2025-05-07T20:22:34.9474251Z Current runner version: '2.323.0'
2025-05-07T20:22:34.9480282Z Runner name: 'i-00cb9561c833cfdb2'
2025-05-07T20:22:34.9481172Z Machine name: 'ip-10-0-73-154'
2025-05-07T20:22:34.9483974Z ##[group]GITHUB_TOKEN Permissions
2025-05-07T20:22:34.9486239Z Contents: read
2025-05-07T20:22:34.9486747Z Metadata: read
2025-05-07T20:22:34.9487238Z Packages: read
2025-05-07T20:22:34.9487715Z ##[endgroup]
2025-05-07T20:22:34.9489588Z Secret source: None
2025-05-07T20:22:34.9490273Z Prepare workflow directory
2025-05-07T20:22:35.0007182Z Prepare all required actions
2025-05-07T20:22:35.0044849Z Getting action download info
2025-05-07T20:22:35.2235599Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683)
2025-05-07T20:22:35.5364156Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093)
2025-05-07T20:22:35.9021723Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187)
2025-05-07T20:22:37.5121342Z Getting action download info
2025-05-07T20:22:37.6335534Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482)
2025-05-07T20:22:37.8306913Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.9, 12.6.3, 12.6.3, gcc)
2025-05-07T20:22:37.8913616Z A job started hook has been configured by the self-hosted runner administrator
2025-05-07T20:22:37.9048386Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh'
2025-05-07T20:22:37.9061272Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:22:37.9062810Z ##[endgroup]
2025-05-07T20:22:38.9744665Z Runner Type: linux.g5.4xlarge.nvidia.gpu
2025-05-07T20:22:38.9745381Z Instance Type: g5.4xlarge
2025-05-07T20:22:38.9745728Z AMI Name: unknown
2025-05-07T20:22:38.9784632Z AMI ID: ami-071226ecf16aa7d96
2025-05-07T20:22:44.3224232Z ##[group]Run actions/checkout@v4
2025-05-07T20:22:44.3224548Z with:
2025-05-07T20:22:44.3224778Z   submodules: true
2025-05-07T20:22:44.3225017Z   repository: pytorch/FBGEMM
2025-05-07T20:22:44.3225418Z   token: ***
2025-05-07T20:22:44.3225619Z   ssh-strict: true
2025-05-07T20:22:44.3225834Z   ssh-user: git
2025-05-07T20:22:44.3226052Z   persist-credentials: true
2025-05-07T20:22:44.3226304Z   clean: true
2025-05-07T20:22:44.3226533Z   sparse-checkout-cone-mode: true
2025-05-07T20:22:44.3226798Z   fetch-depth: 1
2025-05-07T20:22:44.3227014Z   fetch-tags: false
2025-05-07T20:22:44.3227232Z   show-progress: true
2025-05-07T20:22:44.3227458Z   lfs: false
2025-05-07T20:22:44.3227665Z   set-safe-directory: true
2025-05-07T20:22:44.3227923Z env:
2025-05-07T20:22:44.3228138Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:44.3228454Z   BUILD_ENV: build_binary
2025-05-07T20:22:44.3228731Z   BUILD_TARGET: genai
2025-05-07T20:22:44.3228970Z   BUILD_VARIANT: cuda
2025-05-07T20:22:44.3229241Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:22:44.3229498Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:44.3229843Z ##[endgroup]
2025-05-07T20:22:44.4384315Z Syncing repository: pytorch/FBGEMM
2025-05-07T20:22:44.4385496Z ##[group]Getting Git version info
2025-05-07T20:22:44.4385946Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM'
2025-05-07T20:22:44.4386548Z [command]/usr/bin/git version
2025-05-07T20:22:44.4386812Z git version 2.47.1
2025-05-07T20:22:44.4395353Z ##[endgroup]
2025-05-07T20:22:44.4409287Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/2a7e0901-7173-4864-9b4d-c594ce024a59' before making global git config changes
2025-05-07T20:22:44.4410299Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:22:44.4424281Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:22:44.4461167Z Deleting the contents of '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM'
2025-05-07T20:22:44.4464497Z ##[group]Initializing the repository
2025-05-07T20:22:44.4468684Z [command]/usr/bin/git init /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:22:44.4511712Z hint: Using 'master' as the name for the initial branch. This default branch name
2025-05-07T20:22:44.4512380Z hint: is subject to change. To configure the initial branch name to use in all
2025-05-07T20:22:44.4512907Z hint: of your new repositories, which will suppress this warning, call:
2025-05-07T20:22:44.4513273Z hint:
2025-05-07T20:22:44.4513563Z hint: 	git config --global init.defaultBranch <name>
2025-05-07T20:22:44.4513960Z hint:
2025-05-07T20:22:44.4514321Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
2025-05-07T20:22:44.4514911Z hint: 'development'. The just-created branch can be renamed via this command:
2025-05-07T20:22:44.4515330Z hint:
2025-05-07T20:22:44.4515567Z hint: 	git branch -m <name>
2025-05-07T20:22:44.4516040Z Initialized empty Git repository in /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/
2025-05-07T20:22:44.4524627Z [command]/usr/bin/git remote add origin https://github.com/pytorch/FBGEMM
2025-05-07T20:22:44.4558784Z ##[endgroup]
2025-05-07T20:22:44.4559234Z ##[group]Disabling automatic garbage collection
2025-05-07T20:22:44.4562977Z [command]/usr/bin/git config --local gc.auto 0
2025-05-07T20:22:44.4594318Z ##[endgroup]
2025-05-07T20:22:44.4594696Z ##[group]Setting up auth
2025-05-07T20:22:44.4601195Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:22:44.4633725Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:22:44.4994686Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:22:44.5027328Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:22:44.5375614Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:22:44.5424752Z ##[endgroup]
2025-05-07T20:22:44.5425152Z ##[group]Fetching the repository
2025-05-07T20:22:44.5432982Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge
2025-05-07T20:22:45.2977236Z From https://github.com/pytorch/FBGEMM
2025-05-07T20:22:45.2977904Z  * [new ref]         a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge
2025-05-07T20:22:45.3001717Z ##[endgroup]
2025-05-07T20:22:45.3002130Z ##[group]Determining the checkout info
2025-05-07T20:22:45.3005051Z ##[endgroup]
2025-05-07T20:22:45.3020685Z [command]/usr/bin/git sparse-checkout disable
2025-05-07T20:22:45.3059157Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig
2025-05-07T20:22:45.3098953Z ##[group]Checking out the ref
2025-05-07T20:22:45.3102345Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge
2025-05-07T20:22:45.4172958Z Note: switching to 'refs/remotes/pull/4066/merge'.
2025-05-07T20:22:45.4173487Z 
2025-05-07T20:22:45.4173967Z You are in 'detached HEAD' state. You can look around, make experimental
2025-05-07T20:22:45.4175022Z changes and commit them, and you can discard any commits you make in this
2025-05-07T20:22:45.4175564Z state without impacting any branches by switching back to a branch.
2025-05-07T20:22:45.4175878Z 
2025-05-07T20:22:45.4176086Z If you want to create a new branch to retain commits you create, you may
2025-05-07T20:22:45.4176546Z do so (now or later) by using -c with the switch command. Example:
2025-05-07T20:22:45.4176807Z 
2025-05-07T20:22:45.4176916Z   git switch -c <new-branch-name>
2025-05-07T20:22:45.4177108Z 
2025-05-07T20:22:45.4177234Z Or undo this operation with:
2025-05-07T20:22:45.4177405Z 
2025-05-07T20:22:45.4177491Z   git switch -
2025-05-07T20:22:45.4177901Z 
2025-05-07T20:22:45.4178138Z Turn off this advice by setting config variable advice.detachedHead to false
2025-05-07T20:22:45.4178460Z 
2025-05-07T20:22:45.4178837Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4
2025-05-07T20:22:45.4184987Z ##[endgroup]
2025-05-07T20:22:45.4185386Z ##[group]Setting up auth for fetching submodules
2025-05-07T20:22:45.4190556Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:22:45.4234742Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf
2025-05-07T20:22:45.4267677Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com:
2025-05-07T20:22:45.4300704Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com:
2025-05-07T20:22:45.4329604Z ##[endgroup]
2025-05-07T20:22:45.4329977Z ##[group]Fetching submodules
2025-05-07T20:22:45.4332465Z [command]/usr/bin/git submodule sync
2025-05-07T20:22:45.4678933Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1
2025-05-07T20:22:45.5010105Z Submodule 'external/asmjit' (https://github.com/asmjit/asmjit.git) registered for path 'external/asmjit'
2025-05-07T20:22:45.5012342Z Submodule 'external/composable_kernel' (https://github.com/jwfromm/composable_kernel.git) registered for path 'external/composable_kernel'
2025-05-07T20:22:45.5015452Z Submodule 'external/cpuinfo' (https://github.com/pytorch/cpuinfo) registered for path 'external/cpuinfo'
2025-05-07T20:22:45.5018745Z Submodule 'external/cutlass' (https://github.com/jwfromm/cutlass) registered for path 'external/cutlass'
2025-05-07T20:22:45.5022177Z Submodule 'external/googletest' (https://github.com/google/googletest) registered for path 'external/googletest'
2025-05-07T20:22:45.5026600Z Submodule 'external/hipify_torch' (https://github.com/ROCmSoftwarePlatform/hipify_torch.git) registered for path 'external/hipify_torch'
2025-05-07T20:22:45.5029821Z Submodule 'external/json' (https://github.com/nlohmann/json.git) registered for path 'external/json'
2025-05-07T20:22:45.5060511Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/asmjit'...
2025-05-07T20:22:45.8030356Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/composable_kernel'...
2025-05-07T20:22:46.2894699Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cpuinfo'...
2025-05-07T20:22:46.6255771Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cutlass'...
2025-05-07T20:22:47.7416867Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/googletest'...
2025-05-07T20:22:48.0659236Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/hipify_torch'...
2025-05-07T20:22:48.4136672Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/json'...
2025-05-07T20:22:49.5516381Z From https://github.com/asmjit/asmjit
2025-05-07T20:22:49.5516843Z  * branch            e5d7c0bd5d9aec44d68830187138149e6a8c4e32 -> FETCH_HEAD
2025-05-07T20:22:49.5995950Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32'
2025-05-07T20:22:51.0981850Z From https://github.com/jwfromm/composable_kernel
2025-05-07T20:22:51.0982331Z  * branch            4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 -> FETCH_HEAD
2025-05-07T20:22:51.3821398Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406'
2025-05-07T20:22:52.1643157Z From https://github.com/pytorch/cpuinfo
2025-05-07T20:22:52.1643696Z  * branch            6543fec09b2f04ac4a666882998b534afc9c1349 -> FETCH_HEAD
2025-05-07T20:22:52.2749983Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349'
2025-05-07T20:22:53.3783687Z From https://github.com/jwfromm/cutlass
2025-05-07T20:22:53.3784545Z  * branch            3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 -> FETCH_HEAD
2025-05-07T20:22:54.0813178Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3'
2025-05-07T20:22:54.9072379Z From https://github.com/google/googletest
2025-05-07T20:22:54.9072837Z  * branch            f8d7d77c06936315286eb55f8de22cd23c188571 -> FETCH_HEAD
2025-05-07T20:22:54.9482534Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571'
2025-05-07T20:22:55.6570347Z From https://github.com/ROCmSoftwarePlatform/hipify_torch
2025-05-07T20:22:55.6570834Z  * branch            420084499c7c1e1c2d801922f40df202eac5f3a0 -> FETCH_HEAD
2025-05-07T20:22:55.6659493Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0'
2025-05-07T20:22:56.4234461Z From https://github.com/nlohmann/json
2025-05-07T20:22:56.4234885Z  * branch            9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 -> FETCH_HEAD
2025-05-07T20:22:56.5352660Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03'
2025-05-07T20:22:56.5374175Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0
2025-05-07T20:22:56.5707399Z Entering 'external/asmjit'
2025-05-07T20:22:56.5740424Z Entering 'external/composable_kernel'
2025-05-07T20:22:56.5773230Z Entering 'external/cpuinfo'
2025-05-07T20:22:56.5805657Z Entering 'external/cutlass'
2025-05-07T20:22:56.5839586Z Entering 'external/googletest'
2025-05-07T20:22:56.5872450Z Entering 'external/hipify_torch'
2025-05-07T20:22:56.5904937Z Entering 'external/json'
2025-05-07T20:22:56.5953037Z ##[endgroup]
2025-05-07T20:22:56.5953431Z ##[group]Persisting credentials for submodules
2025-05-07T20:22:56.5959682Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :"
2025-05-07T20:22:56.6302747Z Entering 'external/asmjit'
2025-05-07T20:22:56.6374979Z Entering 'external/composable_kernel'
2025-05-07T20:22:56.6444845Z Entering 'external/cpuinfo'
2025-05-07T20:22:56.6514751Z Entering 'external/cutlass'
2025-05-07T20:22:56.6588979Z Entering 'external/googletest'
2025-05-07T20:22:56.6658699Z Entering 'external/hipify_torch'
2025-05-07T20:22:56.6728867Z Entering 'external/json'
2025-05-07T20:22:56.6814940Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url"
2025-05-07T20:22:56.7144501Z Entering 'external/asmjit'
2025-05-07T20:22:56.7207474Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config	remote.origin.url
2025-05-07T20:22:56.7210408Z Entering 'external/composable_kernel'
2025-05-07T20:22:56.7273557Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config	remote.origin.url
2025-05-07T20:22:56.7275838Z Entering 'external/cpuinfo'
2025-05-07T20:22:56.7336847Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config	remote.origin.url
2025-05-07T20:22:56.7341021Z Entering 'external/cutlass'
2025-05-07T20:22:56.7401882Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config	remote.origin.url
2025-05-07T20:22:56.7405609Z Entering 'external/googletest'
2025-05-07T20:22:56.7466259Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config	remote.origin.url
2025-05-07T20:22:56.7467643Z Entering 'external/hipify_torch'
2025-05-07T20:22:56.7530339Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config	remote.origin.url
2025-05-07T20:22:56.7532983Z Entering 'external/json'
2025-05-07T20:22:56.7596384Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config	remote.origin.url
2025-05-07T20:22:56.7711222Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:'
2025-05-07T20:22:56.8036981Z Entering 'external/asmjit'
2025-05-07T20:22:56.8069035Z Entering 'external/composable_kernel'
2025-05-07T20:22:56.8101484Z Entering 'external/cpuinfo'
2025-05-07T20:22:56.8133316Z Entering 'external/cutlass'
2025-05-07T20:22:56.8164278Z Entering 'external/googletest'
2025-05-07T20:22:56.8195866Z Entering 'external/hipify_torch'
2025-05-07T20:22:56.8227879Z Entering 'external/json'
2025-05-07T20:22:56.8274736Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:'
2025-05-07T20:22:56.8600140Z Entering 'external/asmjit'
2025-05-07T20:22:56.8631847Z Entering 'external/composable_kernel'
2025-05-07T20:22:56.8663734Z Entering 'external/cpuinfo'
2025-05-07T20:22:56.8695046Z Entering 'external/cutlass'
2025-05-07T20:22:56.8726626Z Entering 'external/googletest'
2025-05-07T20:22:56.8757754Z Entering 'external/hipify_torch'
2025-05-07T20:22:56.8789744Z Entering 'external/json'
2025-05-07T20:22:56.8850169Z ##[endgroup]
2025-05-07T20:22:56.8875729Z [command]/usr/bin/git log -1 --format=%H
2025-05-07T20:22:56.8902747Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:22:56.9095721Z ##[group]Run actions/download-artifact@v4
2025-05-07T20:22:56.9096040Z with:
2025-05-07T20:22:56.9096280Z   name: fbgemm_genai_x86_gcc_py3.9_cu12.6.3.whl
2025-05-07T20:22:56.9096598Z   merge-multiple: false
2025-05-07T20:22:56.9096844Z   repository: pytorch/FBGEMM
2025-05-07T20:22:56.9097095Z   run-id: 14891846252
2025-05-07T20:22:56.9097295Z env:
2025-05-07T20:22:56.9097513Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:56.9097800Z   BUILD_ENV: build_binary
2025-05-07T20:22:56.9098040Z   BUILD_TARGET: genai
2025-05-07T20:22:56.9098254Z   BUILD_VARIANT: cuda
2025-05-07T20:22:56.9098487Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:22:56.9098727Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:56.9098951Z ##[endgroup]
2025-05-07T20:22:57.1409049Z Downloading single artifact
2025-05-07T20:22:57.2443255Z Preparing to download the following artifacts:
2025-05-07T20:22:57.2444069Z - fbgemm_genai_x86_gcc_py3.9_cu12.6.3.whl (ID: 3081362189, Size: 12502543, Expected Digest: sha256:b7fa57ec448da168df38f5dcb4b2b6b212acdb815f6a27a8982ddcd3bf673086)
2025-05-07T20:22:57.2877768Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-e6155e83-5447-52ac-883e-059201805a6b/artifacts/563b5055f9a6d043e54aa78b8ff41f43d378d41ef31414d516f024e334c8085c.zip
2025-05-07T20:22:57.2879172Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:22:57.3490090Z (node:56972) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
2025-05-07T20:22:57.3491030Z (Use `node --trace-deprecation ...` to show where the warning was created)
2025-05-07T20:22:57.5340175Z SHA256 digest of downloaded artifact is b7fa57ec448da168df38f5dcb4b2b6b212acdb815f6a27a8982ddcd3bf673086
2025-05-07T20:22:57.5340976Z Artifact download completed successfully.
2025-05-07T20:22:57.5341308Z Total of 1 artifact(s) downloaded
2025-05-07T20:22:57.5346128Z Download artifact has finished successfully
2025-05-07T20:22:57.5605930Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main
2025-05-07T20:22:57.5606312Z with:
2025-05-07T20:22:57.5606529Z   driver-version: 570.133.07
2025-05-07T20:22:57.5606794Z env:
2025-05-07T20:22:57.5607010Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:57.5607312Z   BUILD_ENV: build_binary
2025-05-07T20:22:57.5607565Z   BUILD_TARGET: genai
2025-05-07T20:22:57.5607790Z   BUILD_VARIANT: cuda
2025-05-07T20:22:57.5608035Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:22:57.5608294Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:57.5608524Z ##[endgroup]
2025-05-07T20:22:57.5700611Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
2025-05-07T20:22:57.5700995Z with:
2025-05-07T20:22:57.5701191Z   timeout_minutes: 10
2025-05-07T20:22:57.5701605Z   max_attempts: 3
2025-05-07T20:22:57.5724856Z   command: # Is it disgusting to have a full shell script here in this github action? Sure
# But is it the best way to make it so that this action relies on nothing else? Absolutely
set -eou pipefail

DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID)
DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run"

install_nvidia_docker2_amzn2() {
    (
        set -x
        # Needed for yum-config-manager
        sudo yum install -y yum-utils
        if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then
          YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo"
        else
          # Amazon Linux 2
          YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo"
        fi

        sudo yum-config-manager --add-repo "${YUM_REPO_URL}"
        sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
        sudo systemctl restart docker
    )
}

install_nvidia_docker2_ubuntu20() {
    (
        set -x
        # Install nvidia-driver package if not installed
        status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)"
        if [ ! $? = 0 ] || [ ! "$status" = installed ]; then
          sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
          sudo systemctl restart docker
        fi
    )
}

pre_install_nvidia_driver_amzn2() {
    (
        # Purge any nvidia driver installed from RHEL repo
        sudo yum remove -y nvidia-driver-latest-dkms
    )
}

install_nvidia_driver_common() {
    (
        # Try to gather more information about the runner and its existing NVIDIA driver if any
        echo "Before installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        HAS_NVIDIA_DRIVER=0
        # Check if NVIDIA driver has already been installed
        if [ -x "$(command -v nvidia-smi)" ]; then
            set +e
            # The driver exists, check its version next. Also check only the first GPU if there are more than one of them
            # so that the same driver version is not print over multiple lines
            INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
            NVIDIA_SMI_STATUS=$?

            if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing"
            elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing"

                # Turn off persistent mode so that the installation script can unload the kernel module
                sudo killall nvidia-persistenced || true
            else
                HAS_NVIDIA_DRIVER=1
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation"
            fi
            set -e
        fi

        if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then
            # CAUTION: this may need to be updated in future
            if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then
                  sudo yum groupinstall -y "Development Tools"
                  # ensure our kernel install is the same as our underlying kernel,
                  # groupinstall "Development Tools" has a habit of mismatching kernel headers
                  sudo yum install -y "kernel-devel-uname-r == $(uname -r)"
                  sudo modprobe backlight
            fi
            sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"

            set +e
            sudo /bin/bash /tmp/nvidia_driver -s --no-drm
            NVIDIA_INSTALLATION_STATUS=$?

            RESET_GPU=0
            if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then
                sudo cat /var/log/nvidia-installer.log
                # Fail to install NVIDIA driver, try to reset the GPU
                RESET_GPU=1
            elif [ -x "$(command -v nvidia-smi)" ]; then
                # Check again if nvidia-smi works even if the driver installation completes successfully
                INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
                NVIDIA_SMI_STATUS=$?

                if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                    RESET_GPU=1
                fi
            fi

            if [ "$RESET_GPU" -eq 1 ]; then
                NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1)
                # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this
                # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388
                for PCI_ID in $NVIDIA_DEVICES; do
                    DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable)

                    echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)"
                    # This requires sudo permission of course
                    echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset
                    sleep 1
                done
            fi

            sudo rm -fv /tmp/nvidia_driver
            set -e
        fi
    )
}

post_install_nvidia_driver_common() {
    (
        sudo modprobe nvidia || true
        echo "After installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        (
            set +e

            nvidia-smi
            # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in
            # the case where the driver has already crashed as it still can get the driver version
            # and some basic information like the bus ID.  However, the rest of the information
            # would be missing (ERR!), for example:
            #
            # +-----------------------------------------------------------------------------+
            # | NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
            # |-------------------------------+----------------------+----------------------+
            # | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
            # | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
            # |                               |                      |               MIG M. |
            # |===============================+======================+======================|
            # |   0  ERR!                Off  | 00000000:00:1E.0 Off |                 ERR! |
            # |ERR!  ERR! ERR!    ERR! / ERR! |   4184MiB / 23028MiB |    ERR!      Default |
            # |                               |                      |                 ERR! |
            # +-------------------------------+----------------------+----------------------+
            #
            # +-----------------------------------------------------------------------------+
            # | Processes:                                                                  |
            # |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
            # |        ID   ID                                                   Usage      |
            # |=============================================================================|
            # +-----------------------------------------------------------------------------+
            #
            # This should be reported as a failure instead as it will guarantee to fail when
            # Docker tries to run with --gpus all
            #
            # So, the correct check here is to query one of the missing piece of info like
            # GPU name, so that the command can fail accordingly
            nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
            NVIDIA_SMI_STATUS=$?

            # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285
            if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then
                echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}"
            else
                echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}"
                exit ${NVIDIA_SMI_STATUS}
            fi
            set -e
        )
    )
}

install_nvidia_driver_amzn2() {
    (
        set -x
        pre_install_nvidia_driver_amzn2
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

install_nvidia_driver_ubuntu20() {
    (
        set -x
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

echo "== Installing nvidia driver ${DRIVER_FN} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_driver_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_driver_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac

# Install container toolkit based on distribution
echo "== Installing nvidia container toolkit for ${DISTRIBUTION} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_docker2_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_docker2_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac
echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"

# Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with
# more than one GPUs. This just needs to be run once. The command fails
# on subsequent runs and complains that the mode is already on, but that's
# ok
sudo nvidia-persistenced || true
# This should show persistence mode ON
nvidia-smi

2025-05-07T20:22:57.5748113Z   retry_wait_seconds: 10
2025-05-07T20:22:57.5748392Z   polling_interval_seconds: 1
2025-05-07T20:22:57.5748660Z   warning_on_retry: true
2025-05-07T20:22:57.5748908Z   continue_on_error: false
2025-05-07T20:22:57.5749147Z env:
2025-05-07T20:22:57.5749363Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:57.5749713Z   BUILD_ENV: build_binary
2025-05-07T20:22:57.5749949Z   BUILD_TARGET: genai
2025-05-07T20:22:57.5750170Z   BUILD_VARIANT: cuda
2025-05-07T20:22:57.5750408Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:22:57.5750663Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:57.5750906Z   DRIVER_VERSION: 570.133.07
2025-05-07T20:22:57.5751150Z ##[endgroup]
2025-05-07T20:22:57.6558649Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run ==
2025-05-07T20:22:57.6560240Z + pre_install_nvidia_driver_amzn2
2025-05-07T20:22:57.6560646Z + sudo yum remove -y nvidia-driver-latest-dkms
2025-05-07T20:22:58.2912457Z No match for argument: nvidia-driver-latest-dkms
2025-05-07T20:22:58.2913312Z No packages marked for removal.
2025-05-07T20:22:58.2978141Z Dependencies resolved.
2025-05-07T20:22:58.2988966Z Nothing to do.
2025-05-07T20:22:58.2990560Z Complete!
2025-05-07T20:22:58.3314640Z + install_nvidia_driver_common
2025-05-07T20:22:58.3318391Z + echo 'Before installing NVIDIA driver'
2025-05-07T20:22:58.3318714Z + lspci
2025-05-07T20:22:58.3320393Z Before installing NVIDIA driver
2025-05-07T20:22:58.3506930Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:22:58.3507689Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:22:58.3508249Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:22:58.3508765Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:22:58.3509229Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:22:58.3509819Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:22:58.3510300Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:22:58.3510776Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:22:58.3511163Z + lsmod
2025-05-07T20:22:58.3551101Z Module                  Size  Used by
2025-05-07T20:22:58.3551399Z xt_conntrack           16384  1
2025-05-07T20:22:58.3551674Z nft_chain_nat          16384  3
2025-05-07T20:22:58.3551943Z xt_MASQUERADE          20480  1
2025-05-07T20:22:58.3552249Z nf_nat                 57344  2 nft_chain_nat,xt_MASQUERADE
2025-05-07T20:22:58.3552584Z nf_conntrack_netlink    57344  0
2025-05-07T20:22:58.3552977Z nf_conntrack          184320  4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:22:58.3553413Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:22:58.3553738Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:22:58.3554027Z xfrm_user              57344  1
2025-05-07T20:22:58.3554299Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:22:58.3554588Z xt_addrtype            16384  2
2025-05-07T20:22:58.3554874Z nft_compat             20480  4
2025-05-07T20:22:58.3555174Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:22:58.3555584Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:22:58.3555966Z br_netfilter           36864  0
2025-05-07T20:22:58.3556238Z bridge                323584  1 br_netfilter
2025-05-07T20:22:58.3556540Z stp                    16384  1 bridge
2025-05-07T20:22:58.3556825Z llc                    16384  2 bridge,stp
2025-05-07T20:22:58.3557103Z overlay               167936  0
2025-05-07T20:22:58.3557355Z tls                   135168  0
2025-05-07T20:22:58.3557613Z nls_ascii              16384  1
2025-05-07T20:22:58.3557905Z nls_cp437              20480  1
2025-05-07T20:22:58.3558156Z vfat                   24576  1
2025-05-07T20:22:58.3558414Z fat                    86016  1 vfat
2025-05-07T20:22:58.3558688Z sunrpc                696320  1
2025-05-07T20:22:58.3558935Z ena                   180224  0
2025-05-07T20:22:58.3559184Z i8042                  45056  0
2025-05-07T20:22:58.3559439Z serio                  28672  3 i8042
2025-05-07T20:22:58.3559710Z ghash_clmulni_intel    16384  0
2025-05-07T20:22:58.3559985Z button                 24576  0
2025-05-07T20:22:58.3560238Z sch_fq_codel           20480  17
2025-05-07T20:22:58.3560488Z fuse                  163840  1
2025-05-07T20:22:58.3560734Z dm_mod                188416  0
2025-05-07T20:22:58.3560992Z configfs               57344  1
2025-05-07T20:22:58.3561239Z dax                    45056  1 dm_mod
2025-05-07T20:22:58.3561509Z loop                   36864  0
2025-05-07T20:22:58.3561757Z dmi_sysfs              20480  0
2025-05-07T20:22:58.3561997Z crc32_pclmul           16384  0
2025-05-07T20:22:58.3562254Z crc32c_intel           24576  0
2025-05-07T20:22:58.3562503Z efivarfs               24576  1
2025-05-07T20:22:58.3562742Z + modinfo nvidia
2025-05-07T20:22:58.3570136Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:22:58.3570612Z import_ns:      DMA_BUF
2025-05-07T20:22:58.3570860Z alias:          char-major-195-*
2025-05-07T20:22:58.3571119Z version:        570.133.07
2025-05-07T20:22:58.3571372Z supported:      external
2025-05-07T20:22:58.3571751Z license:        Dual MIT/GPL
2025-05-07T20:22:58.3572075Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:22:58.3572410Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:22:58.3572847Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:22:58.3573172Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:22:58.3573514Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:22:58.3573843Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:22:58.3574155Z depends:        i2c-core,drm
2025-05-07T20:22:58.3574417Z retpoline:      Y
2025-05-07T20:22:58.3574629Z name:           nvidia
2025-05-07T20:22:58.3574992Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:22:58.3575468Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:22:58.3575901Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:22:58.3576402Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:22:58.3576712Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:22:58.3577027Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:22:58.3577343Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:22:58.3577645Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:22:58.3577947Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:22:58.3578302Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:22:58.3578687Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:22:58.3579018Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:22:58.3579312Z parm:           NVreg_EnableMSI:int
2025-05-07T20:22:58.3579622Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:22:58.3579989Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:22:58.3580385Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:22:58.3580759Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:22:58.3581178Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:58.3581584Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:22:58.3582005Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:58.3582419Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:22:58.3582758Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:22:58.3583123Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:22:58.3583497Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:22:58.3583838Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:22:58.3584168Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:22:58.3584499Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:22:58.3584822Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:22:58.3585136Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:22:58.3585478Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:22:58.3585842Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:22:58.3586171Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:22:58.3586502Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:22:58.3586852Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:22:58.3587191Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:22:58.3587527Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:22:58.3587858Z parm:           NVreg_RmMsg:charp
2025-05-07T20:22:58.3588151Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:22:58.3588475Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:22:58.3588796Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:22:58.3589113Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:22:58.3589442Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:22:58.3589850Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:22:58.3590201Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:22:58.3590539Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:22:58.3590879Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:22:58.3591219Z parm:           rm_firmware_active:charp
2025-05-07T20:22:58.3591608Z + HAS_NVIDIA_DRIVER=0
2025-05-07T20:22:58.3591853Z ++ command -v nvidia-smi
2025-05-07T20:22:58.3592103Z + '[' -x /usr/bin/nvidia-smi ']'
2025-05-07T20:22:58.3592364Z + set +e
2025-05-07T20:22:58.3592674Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0
2025-05-07T20:23:00.1692137Z + INSTALLED_DRIVER_VERSION=570.133.07
2025-05-07T20:23:00.1692512Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:23:00.1692766Z + '[' 0 -ne 0 ']'
2025-05-07T20:23:00.1692987Z + '[' 570.133.07 '!=' 570.133.07 ']'
2025-05-07T20:23:00.1693246Z + HAS_NVIDIA_DRIVER=1
2025-05-07T20:23:00.1693672Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation'
2025-05-07T20:23:00.1694137Z + set -e
2025-05-07T20:23:00.1694941Z + '[' 1 -eq 0 ']'
2025-05-07T20:23:00.1695328Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation
2025-05-07T20:23:00.1695784Z + post_install_nvidia_driver_common
2025-05-07T20:23:00.1698111Z + sudo modprobe nvidia
2025-05-07T20:23:00.2695862Z + echo 'After installing NVIDIA driver'
2025-05-07T20:23:00.2696204Z + lspci
2025-05-07T20:23:00.2696428Z After installing NVIDIA driver
2025-05-07T20:23:00.2812019Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:00.2812525Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:00.2813075Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:00.2813601Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:23:00.2814072Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:23:00.2814595Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:00.2815103Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:00.2815570Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:23:00.2815976Z + lsmod
2025-05-07T20:23:00.2845682Z Module                  Size  Used by
2025-05-07T20:23:00.2846014Z nvidia_uvm           1884160  0
2025-05-07T20:23:00.2846280Z nvidia              11583488  1 nvidia_uvm
2025-05-07T20:23:00.2846568Z drm                   602112  1 nvidia
2025-05-07T20:23:00.2846873Z drm_panel_orientation_quirks    32768  1 drm
2025-05-07T20:23:00.2847177Z backlight              24576  1 drm
2025-05-07T20:23:00.2847463Z i2c_core              110592  2 nvidia,drm
2025-05-07T20:23:00.2847751Z xt_conntrack           16384  1
2025-05-07T20:23:00.2848009Z nft_chain_nat          16384  3
2025-05-07T20:23:00.2848271Z xt_MASQUERADE          20480  1
2025-05-07T20:23:00.2848625Z nf_nat                 57344  2 nft_chain_nat,xt_MASQUERADE
2025-05-07T20:23:00.2848964Z nf_conntrack_netlink    57344  0
2025-05-07T20:23:00.2849356Z nf_conntrack          184320  4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:23:00.2849796Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:23:00.2850123Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:23:00.2850410Z xfrm_user              57344  1
2025-05-07T20:23:00.2850681Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:23:00.2850981Z xt_addrtype            16384  2
2025-05-07T20:23:00.2851243Z nft_compat             20480  4
2025-05-07T20:23:00.2851545Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:23:00.2851960Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:23:00.2852337Z br_netfilter           36864  0
2025-05-07T20:23:00.2852625Z bridge                323584  1 br_netfilter
2025-05-07T20:23:00.2852916Z stp                    16384  1 bridge
2025-05-07T20:23:00.2853200Z llc                    16384  2 bridge,stp
2025-05-07T20:23:00.2853489Z overlay               167936  0
2025-05-07T20:23:00.2853736Z tls                   135168  0
2025-05-07T20:23:00.2853985Z nls_ascii              16384  1
2025-05-07T20:23:00.2854488Z nls_cp437              20480  1
2025-05-07T20:23:00.2854743Z vfat                   24576  1
2025-05-07T20:23:00.2854993Z fat                    86016  1 vfat
2025-05-07T20:23:00.2855259Z sunrpc                696320  1
2025-05-07T20:23:00.2855506Z ena                   180224  0
2025-05-07T20:23:00.2855754Z i8042                  45056  0
2025-05-07T20:23:00.2856004Z serio                  28672  3 i8042
2025-05-07T20:23:00.2856274Z ghash_clmulni_intel    16384  0
2025-05-07T20:23:00.2856536Z button                 24576  0
2025-05-07T20:23:00.2856791Z sch_fq_codel           20480  17
2025-05-07T20:23:00.2857049Z fuse                  163840  1
2025-05-07T20:23:00.2857293Z dm_mod                188416  0
2025-05-07T20:23:00.2857542Z configfs               57344  1
2025-05-07T20:23:00.2857972Z dax                    45056  1 dm_mod
2025-05-07T20:23:00.2858253Z loop                   36864  0
2025-05-07T20:23:00.2858506Z dmi_sysfs              20480  0
2025-05-07T20:23:00.2858759Z crc32_pclmul           16384  0
2025-05-07T20:23:00.2859012Z crc32c_intel           24576  0
2025-05-07T20:23:00.2859262Z efivarfs               24576  1
2025-05-07T20:23:00.2859508Z + modinfo nvidia
2025-05-07T20:23:00.2862468Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:23:00.2862937Z import_ns:      DMA_BUF
2025-05-07T20:23:00.2863189Z alias:          char-major-195-*
2025-05-07T20:23:00.2863454Z version:        570.133.07
2025-05-07T20:23:00.2863700Z supported:      external
2025-05-07T20:23:00.2863951Z license:        Dual MIT/GPL
2025-05-07T20:23:00.2864244Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:23:00.2864577Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:23:00.2864898Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:23:00.2865220Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:23:00.2865562Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:23:00.2865893Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:23:00.2866208Z depends:        i2c-core,drm
2025-05-07T20:23:00.2866469Z retpoline:      Y
2025-05-07T20:23:00.2866679Z name:           nvidia
2025-05-07T20:23:00.2867037Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:23:00.2867506Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:23:00.2867954Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:23:00.2868408Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:23:00.2868714Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:23:00.2869015Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:23:00.2869321Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:23:00.2869707Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:23:00.2870016Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:23:00.2870369Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:23:00.2870757Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:23:00.2871087Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:23:00.2871380Z parm:           NVreg_EnableMSI:int
2025-05-07T20:23:00.2871686Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:23:00.2872043Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:23:00.2872427Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:23:00.2872802Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:23:00.2873216Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:00.2873621Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:23:00.2874029Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:00.2874433Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:23:00.2874773Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:23:00.2875132Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:23:00.2875607Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:23:00.2875949Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:23:00.2876265Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:23:00.2876598Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:23:00.2876920Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:23:00.2877229Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:23:00.2877568Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:23:00.2877923Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:23:00.2878251Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:23:00.2878631Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:23:00.2878972Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:23:00.2879390Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:23:00.2879720Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:23:00.2880047Z parm:           NVreg_RmMsg:charp
2025-05-07T20:23:00.2880339Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:23:00.2880662Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:23:00.2880977Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:23:00.2881288Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:23:00.2881615Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:23:00.2881967Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:23:00.2882376Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:23:00.2882696Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:23:00.2883041Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:23:00.2883383Z parm:           rm_firmware_active:charp
2025-05-07T20:23:00.2883661Z + set +e
2025-05-07T20:23:00.2883854Z + nvidia-smi
2025-05-07T20:23:01.6955431Z Wed May  7 20:23:01 2025       
2025-05-07T20:23:01.6955815Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:01.6956402Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:01.6956969Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:01.6957452Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:01.6957973Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:01.6958400Z |                                         |                        |               MIG M. |
2025-05-07T20:23:01.6958726Z |=========================================+========================+======================|
2025-05-07T20:23:01.7020415Z |   0  NVIDIA A10G                    Off |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:01.7021630Z |  0%   29C    P0             62W /  300W |       0MiB /  23028MiB |      4%      Default |
2025-05-07T20:23:01.7022482Z |                                         |                        |                  N/A |
2025-05-07T20:23:01.7023350Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:01.7024110Z                                                                                          
2025-05-07T20:23:01.7024865Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:01.7025694Z | Processes:                                                                              |
2025-05-07T20:23:01.7026542Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:01.7027649Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:01.7028196Z |=========================================================================================|
2025-05-07T20:23:01.7028665Z |  No running processes found                                                             |
2025-05-07T20:23:01.7029367Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:02.1156283Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
2025-05-07T20:23:03.5225822Z NVIDIA A10G
2025-05-07T20:23:03.7918963Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:23:03.7919251Z + '[' 0 -eq 0 ']'
2025-05-07T20:23:03.7919590Z + echo 'INFO: Ignoring allowed status 0'
2025-05-07T20:23:03.7919909Z + set -e
2025-05-07T20:23:03.7920116Z INFO: Ignoring allowed status 0
2025-05-07T20:23:03.7928332Z == Installing nvidia container toolkit for amzn2023 ==
2025-05-07T20:23:03.7931444Z + sudo yum install -y yum-utils
2025-05-07T20:23:04.1903209Z Last metadata expiration check: 0:07:01 ago on Wed May  7 20:16:03 2025.
2025-05-07T20:23:04.2152286Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed.
2025-05-07T20:23:04.2546013Z Dependencies resolved.
2025-05-07T20:23:04.2728531Z Nothing to do.
2025-05-07T20:23:04.2728881Z Complete!
2025-05-07T20:23:04.3114697Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]]
2025-05-07T20:23:04.3115319Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:04.3116179Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:04.6623250Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:04.7179635Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
2025-05-07T20:23:05.3279446Z nvidia-container-toolkit                         14 kB/s | 833  B     00:00    
2025-05-07T20:23:05.3525216Z Package nvidia-docker2-2.14.0-1.noarch is already installed.
2025-05-07T20:23:05.3927019Z Dependencies resolved.
2025-05-07T20:23:05.4104638Z ================================================================================
2025-05-07T20:23:05.4105049Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:05.4105441Z ================================================================================
2025-05-07T20:23:05.4105747Z Downgrading:
2025-05-07T20:23:05.4106109Z  nvidia-container-toolkit      x86_64 1.16.2-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:05.4106699Z  nvidia-container-toolkit-base x86_64 1.16.2-1   nvidia-container-toolkit 5.6 M
2025-05-07T20:23:05.4107050Z 
2025-05-07T20:23:05.4107138Z Transaction Summary
2025-05-07T20:23:05.4107387Z ================================================================================
2025-05-07T20:23:05.4107694Z Downgrade  2 Packages
2025-05-07T20:23:05.4107839Z 
2025-05-07T20:23:05.4107960Z Total download size: 6.8 M
2025-05-07T20:23:05.4109868Z Downloading Packages:
2025-05-07T20:23:05.4588880Z (1/2): nvidia-container-toolkit-1.16.2-1.x86_64  26 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:05.5291708Z (2/2): nvidia-container-toolkit-base-1.16.2-1.x  48 MB/s | 5.6 MB     00:00    
2025-05-07T20:23:05.5299987Z --------------------------------------------------------------------------------
2025-05-07T20:23:05.5302954Z Total                                            57 MB/s | 6.8 MB     00:00     
2025-05-07T20:23:05.5305967Z Running transaction check
2025-05-07T20:23:05.5407832Z Transaction check succeeded.
2025-05-07T20:23:05.5408110Z Running transaction test
2025-05-07T20:23:05.5705489Z Transaction test succeeded.
2025-05-07T20:23:05.5706989Z Running transaction
2025-05-07T20:23:06.1210901Z   Preparing        :                                                        1/1 
2025-05-07T20:23:06.2279722Z   Downgrading      : nvidia-container-toolkit-base-1.16.2-1.x86_64          1/4 
2025-05-07T20:23:06.2317789Z   Downgrading      : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:06.2518721Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:06.2519300Z   Cleanup          : nvidia-container-toolkit-1.17.6-1.x86_64               3/4 
2025-05-07T20:23:06.2633999Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               3/4 
2025-05-07T20:23:06.2671185Z   Cleanup          : nvidia-container-toolkit-base-1.17.6-1.x86_64          4/4 
2025-05-07T20:23:07.6782672Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               4/4 
2025-05-07T20:23:07.6783258Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               1/4 
2025-05-07T20:23:07.6783775Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:07.6784307Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          3/4 
2025-05-07T20:23:07.8147363Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          4/4================================================================================
2025-05-07T20:23:07.8148253Z WARNING:
2025-05-07T20:23:07.8148489Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:07.8148727Z 
2025-05-07T20:23:07.8148817Z   Available Versions:
2025-05-07T20:23:07.8148969Z 
2025-05-07T20:23:07.8149071Z   Version 2023.7.20250331:
2025-05-07T20:23:07.8149397Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:07.8149762Z 
2025-05-07T20:23:07.8149893Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:07.8150097Z 
2025-05-07T20:23:07.8150176Z     Release notes:
2025-05-07T20:23:07.8150584Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:07.8150949Z 
2025-05-07T20:23:07.8151040Z   Version 2023.7.20250414:
2025-05-07T20:23:07.8151336Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:07.8151590Z 
2025-05-07T20:23:07.8151702Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:07.8151907Z 
2025-05-07T20:23:07.8151991Z     Release notes:
2025-05-07T20:23:07.8152384Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:07.8152744Z 
2025-05-07T20:23:07.8152829Z   Version 2023.7.20250428:
2025-05-07T20:23:07.8153134Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:07.8153384Z 
2025-05-07T20:23:07.8153494Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:07.8153695Z 
2025-05-07T20:23:07.8153801Z     Release notes:
2025-05-07T20:23:07.8164917Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:07.8165299Z 
2025-05-07T20:23:07.8165418Z ================================================================================
2025-05-07T20:23:07.8499196Z  
2025-05-07T20:23:07.8499464Z 
2025-05-07T20:23:07.8499551Z Downgraded:
2025-05-07T20:23:07.8499917Z   nvidia-container-toolkit-1.16.2-1.x86_64                                      
2025-05-07T20:23:07.8500484Z   nvidia-container-toolkit-base-1.16.2-1.x86_64                                 
2025-05-07T20:23:07.8500853Z 
2025-05-07T20:23:07.8500939Z Complete!
2025-05-07T20:23:07.8965599Z + sudo systemctl restart docker
2025-05-07T20:23:12.1351880Z Wed May  7 20:23:12 2025       
2025-05-07T20:23:12.1352686Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:12.1353687Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:12.1354645Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:12.1355618Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:12.1356658Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:12.1357506Z |                                         |                        |               MIG M. |
2025-05-07T20:23:12.1358169Z |=========================================+========================+======================|
2025-05-07T20:23:12.1433958Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:12.1434724Z |  0%   29C    P0             62W /  300W |       0MiB /  23028MiB |      4%      Default |
2025-05-07T20:23:12.1435111Z |                                         |                        |                  N/A |
2025-05-07T20:23:12.1435511Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:12.1435903Z                                                                                          
2025-05-07T20:23:12.1436296Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:12.1436726Z | Processes:                                                                              |
2025-05-07T20:23:12.1437169Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:12.1437720Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:12.1438071Z |=========================================================================================|
2025-05-07T20:23:12.1438719Z |  No running processes found                                                             |
2025-05-07T20:23:12.2898970Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:12.6307371Z Command completed after 1 attempt(s).
2025-05-07T20:23:12.6393850Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info
2025-05-07T20:23:12.6394340Z [36;1m. $PRELUDE; print_system_info; print_ec2_info[0m
2025-05-07T20:23:12.6409500Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:12.6409856Z env:
2025-05-07T20:23:12.6410076Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:12.6410390Z   BUILD_ENV: build_binary
2025-05-07T20:23:12.6410642Z   BUILD_TARGET: genai
2025-05-07T20:23:12.6410889Z   BUILD_VARIANT: cuda
2025-05-07T20:23:12.6411132Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:12.6411400Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:12.6411713Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:12.6412041Z ##[endgroup]
2025-05-07T20:23:12.9760133Z ################################################################################
2025-05-07T20:23:12.9760493Z # Print System Info
2025-05-07T20:23:12.9760711Z #
2025-05-07T20:23:12.9776808Z # [2025-05-07T20:23:12.977Z] + print_system_info 
2025-05-07T20:23:12.9777272Z ################################################################################
2025-05-07T20:23:12.9777495Z 
2025-05-07T20:23:12.9777606Z ################################################################################
2025-05-07T20:23:12.9777934Z [INFO] Printing environment variables ...
2025-05-07T20:23:12.9778222Z + printenv
2025-05-07T20:23:12.9778346Z 
2025-05-07T20:23:12.9800732Z SHELL=/bin/bash
2025-05-07T20:23:12.9801250Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:12.9801796Z BUILD_VARIANT=cuda
2025-05-07T20:23:12.9802464Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_6c888332-cb40-41f2-a59e-2fe3ef0a577a
2025-05-07T20:23:12.9803267Z GITHUB_ACTION=__run
2025-05-07T20:23:12.9803656Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:12.9804316Z GITHUB_RUN_NUMBER=10601
2025-05-07T20:23:12.9804638Z RUNNER_NAME=i-00cb9561c833cfdb2
2025-05-07T20:23:12.9804925Z GITHUB_REPOSITORY_OWNER_ID=21003710
2025-05-07T20:23:12.9805227Z PLATFORM_NAME_LC=linux-x86_64
2025-05-07T20:23:12.9805478Z MACHINE_NAME_LC=x86_64
2025-05-07T20:23:12.9805847Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh
2025-05-07T20:23:12.9806267Z GITHUB_TRIGGERING_ACTOR=q10
2025-05-07T20:23:12.9806535Z PRELUDE=.github/scripts/setup_env.bash
2025-05-07T20:23:12.9806832Z GITHUB_REF_TYPE=branch
2025-05-07T20:23:12.9807478Z ***
2025-05-07T20:23:12.9807690Z LOGNAME=ec2-user
2025-05-07T20:23:12.9807921Z GITHUB_REPOSITORY_ID=150154628
2025-05-07T20:23:12.9808190Z ENFORCE_CUDA_DEVICE=1
2025-05-07T20:23:12.9808422Z GITHUB_ACTIONS=true
2025-05-07T20:23:12.9808637Z SYSTEMD_EXEC_PID=55529
2025-05-07T20:23:12.9808918Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:23:12.9809461Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge
2025-05-07T20:23:12.9809962Z RUNNER_ENVIRONMENT=self-hosted
2025-05-07T20:23:12.9810244Z GITHUB_REF=refs/pull/4066/merge
2025-05-07T20:23:12.9810504Z RUNNER_OS=Linux
2025-05-07T20:23:12.9810744Z GITHUB_REF_PROTECTED=false
2025-05-07T20:23:12.9811019Z HOME=/home/ec2-user
2025-05-07T20:23:12.9811273Z GITHUB_API_URL=https://api.github.com
2025-05-07T20:23:12.9811559Z LANG=C.UTF-8
2025-05-07T20:23:12.9811850Z RUNNER_TRACKING_ID=github_861545f2-750e-499f-bdda-e801da2ef5a8
2025-05-07T20:23:12.9812201Z RUNNER_ARCH=X64
2025-05-07T20:23:12.9812477Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp
2025-05-07T20:23:12.9813106Z BUILD_TARGET=genai
2025-05-07T20:23:12.9813632Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_6c888332-cb40-41f2-a59e-2fe3ef0a577a
2025-05-07T20:23:12.9814491Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_6c888332-cb40-41f2-a59e-2fe3ef0a577a
2025-05-07T20:23:12.9815210Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json
2025-05-07T20:23:12.9815874Z INVOCATION_ID=7ee3562b3fc14d84a45c0646162e5533
2025-05-07T20:23:12.9816194Z GITHUB_EVENT_NAME=pull_request
2025-05-07T20:23:12.9816457Z GITHUB_RUN_ID=14891846252
2025-05-07T20:23:12.9817021Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_6c888332-cb40-41f2-a59e-2fe3ef0a577a
2025-05-07T20:23:12.9817627Z BUILD_ENV=build_binary
2025-05-07T20:23:12.9817855Z GITHUB_ACTOR=q10
2025-05-07T20:23:12.9818063Z GITHUB_RUN_ATTEMPT=1
2025-05-07T20:23:12.9818286Z KERN_NAME_LC=linux
2025-05-07T20:23:12.9818514Z BUILD_CUDA_VERSION=12.6.3
2025-05-07T20:23:12.9818810Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql
2025-05-07T20:23:12.9819150Z PLATFORM_NAME=Linux-x86_64
2025-05-07T20:23:12.9819454Z USER=ec2-user
2025-05-07T20:23:12.9819766Z GITHUB_SERVER_URL=https://github.com
2025-05-07T20:23:12.9820147Z SHLVL=1
2025-05-07T20:23:12.9820412Z GITHUB_ACTOR_ID=255046
2025-05-07T20:23:12.9820840Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool
2025-05-07T20:23:12.9821427Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e
2025-05-07T20:23:12.9821780Z GITHUB_REF_NAME=4066/merge
2025-05-07T20:23:12.9822017Z KERN_NAME=Linux
2025-05-07T20:23:12.9822231Z GITHUB_JOB=test_and_publish_artifact
2025-05-07T20:23:12.9822774Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh
2025-05-07T20:23:12.9823344Z GITHUB_REPOSITORY=pytorch/FBGEMM
2025-05-07T20:23:12.9823718Z GITHUB_RETENTION_DAYS=90
2025-05-07T20:23:12.9823952Z JOURNAL_STREAM=8:91669
2025-05-07T20:23:12.9824264Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM
2025-05-07T20:23:12.9824622Z GITHUB_ACTION_REPOSITORY=
2025-05-07T20:23:12.9824922Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
2025-05-07T20:23:12.9825253Z GITHUB_BASE_REF=main
2025-05-07T20:23:12.9825470Z CI=true
2025-05-07T20:23:12.9825666Z GITHUB_REPOSITORY_OWNER=pytorch
2025-05-07T20:23:12.9825954Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6
2025-05-07T20:23:12.9826232Z GITHUB_ACTION_REF=
2025-05-07T20:23:12.9826469Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI
2025-05-07T20:23:12.9827073Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_6c888332-cb40-41f2-a59e-2fe3ef0a577a
2025-05-07T20:23:12.9827654Z MACHINE_NAME=x86_64
2025-05-07T20:23:12.9827874Z _=/usr/bin/printenv
2025-05-07T20:23:12.9828004Z 
2025-05-07T20:23:12.9828117Z ################################################################################
2025-05-07T20:23:12.9828433Z [INFO] Print ldd version ...
2025-05-07T20:23:12.9828696Z + ldd --version
2025-05-07T20:23:12.9828823Z 
2025-05-07T20:23:12.9828906Z ldd (GNU libc) 2.34
2025-05-07T20:23:12.9829186Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:23:12.9829697Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:23:12.9830221Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:23:12.9830658Z Written by Roland McGrath and Ulrich Drepper.
2025-05-07T20:23:12.9830905Z 
2025-05-07T20:23:12.9831037Z ################################################################################
2025-05-07T20:23:12.9831338Z [INFO] Print CPU info ...
2025-05-07T20:23:12.9831560Z + nproc
2025-05-07T20:23:12.9831673Z 
2025-05-07T20:23:12.9845940Z 16
2025-05-07T20:23:12.9847697Z 
2025-05-07T20:23:12.9848037Z + lscpu
2025-05-07T20:23:12.9848195Z 
2025-05-07T20:23:12.9959186Z Architecture:                         x86_64
2025-05-07T20:23:12.9959681Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:23:12.9960499Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:23:12.9961021Z Byte Order:                           Little Endian
2025-05-07T20:23:12.9961458Z CPU(s):                               16
2025-05-07T20:23:12.9961843Z On-line CPU(s) list:                  0-15
2025-05-07T20:23:12.9962218Z Vendor ID:                            AuthenticAMD
2025-05-07T20:23:12.9962548Z Model name:                           AMD EPYC 7R32
2025-05-07T20:23:12.9962851Z CPU family:                           23
2025-05-07T20:23:12.9963452Z Model:                                49
2025-05-07T20:23:12.9963741Z Thread(s) per core:                   2
2025-05-07T20:23:12.9964016Z Core(s) per socket:                   8
2025-05-07T20:23:12.9964292Z Socket(s):                            1
2025-05-07T20:23:12.9964560Z Stepping:                             0
2025-05-07T20:23:12.9964852Z BogoMIPS:                             5600.00
2025-05-07T20:23:12.9966917Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.9969073Z Hypervisor vendor:                    KVM
2025-05-07T20:23:12.9969368Z Virtualization type:                  full
2025-05-07T20:23:12.9969699Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:23:12.9970054Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:23:12.9970396Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:23:12.9970739Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:23:12.9971076Z NUMA node(s):                         1
2025-05-07T20:23:12.9971385Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:23:12.9971710Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:23:12.9972114Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:23:12.9972639Z Vulnerability L1tf:                   Not affected
2025-05-07T20:23:12.9973116Z Vulnerability Mds:                    Not affected
2025-05-07T20:23:12.9973599Z Vulnerability Meltdown:               Not affected
2025-05-07T20:23:12.9974085Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:23:12.9974579Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:23:12.9975308Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:23:12.9976098Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:23:12.9976708Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:23:12.9977561Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:23:12.9978413Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:23:12.9979083Z Vulnerability Srbds:                  Not affected
2025-05-07T20:23:12.9979444Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:23:12.9979667Z 
2025-05-07T20:23:12.9979837Z + cat /proc/cpuinfo
2025-05-07T20:23:12.9979989Z 
2025-05-07T20:23:12.9980074Z processor	: 0
2025-05-07T20:23:12.9980303Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.9980555Z cpu family	: 23
2025-05-07T20:23:12.9980774Z model		: 49
2025-05-07T20:23:12.9980996Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.9981246Z stepping	: 0
2025-05-07T20:23:12.9981471Z microcode	: 0x830107f
2025-05-07T20:23:12.9981802Z cpu MHz		: 3333.146
2025-05-07T20:23:12.9982013Z cache size	: 512 KB
2025-05-07T20:23:12.9982216Z physical id	: 0
2025-05-07T20:23:12.9982417Z siblings	: 16
2025-05-07T20:23:12.9982614Z core id		: 0
2025-05-07T20:23:12.9982802Z cpu cores	: 8
2025-05-07T20:23:12.9982994Z apicid		: 0
2025-05-07T20:23:12.9983183Z initial apicid	: 0
2025-05-07T20:23:12.9983383Z fpu		: yes
2025-05-07T20:23:12.9983572Z fpu_exception	: yes
2025-05-07T20:23:12.9983780Z cpuid level	: 13
2025-05-07T20:23:12.9983973Z wp		: yes
2025-05-07T20:23:12.9986003Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.9988214Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.9988696Z bogomips	: 5600.00
2025-05-07T20:23:12.9988905Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.9989132Z clflush size	: 64
2025-05-07T20:23:12.9989338Z cache_alignment	: 64
2025-05-07T20:23:12.9989666Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.9989986Z power management:
2025-05-07T20:23:12.9990127Z 
2025-05-07T20:23:12.9990205Z processor	: 1
2025-05-07T20:23:12.9990417Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.9990661Z cpu family	: 23
2025-05-07T20:23:12.9990895Z model		: 49
2025-05-07T20:23:12.9991093Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.9991324Z stepping	: 0
2025-05-07T20:23:12.9991530Z microcode	: 0x830107f
2025-05-07T20:23:12.9991748Z cpu MHz		: 3305.353
2025-05-07T20:23:12.9991951Z cache size	: 512 KB
2025-05-07T20:23:12.9992169Z physical id	: 0
2025-05-07T20:23:12.9992378Z siblings	: 16
2025-05-07T20:23:12.9992566Z core id		: 1
2025-05-07T20:23:12.9992762Z cpu cores	: 8
2025-05-07T20:23:12.9992956Z apicid		: 2
2025-05-07T20:23:12.9993139Z initial apicid	: 2
2025-05-07T20:23:12.9993350Z fpu		: yes
2025-05-07T20:23:12.9993541Z fpu_exception	: yes
2025-05-07T20:23:12.9993749Z cpuid level	: 13
2025-05-07T20:23:12.9993958Z wp		: yes
2025-05-07T20:23:12.9995915Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.9998157Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.9998665Z bogomips	: 5600.00
2025-05-07T20:23:12.9998929Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.9999216Z clflush size	: 64
2025-05-07T20:23:12.9999446Z cache_alignment	: 64
2025-05-07T20:23:12.9999701Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:13.0000011Z power management:
2025-05-07T20:23:13.0000140Z 
2025-05-07T20:23:13.0000235Z processor	: 2
2025-05-07T20:23:13.0000438Z vendor_id	: AuthenticAMD
2025-05-07T20:23:13.0000667Z cpu family	: 23
2025-05-07T20:23:13.0000868Z model		: 49
2025-05-07T20:23:13.0001086Z model name	: AMD EPYC 7R32
2025-05-07T20:23:13.0001350Z stepping	: 0
2025-05-07T20:23:13.0001555Z microcode	: 0x830107f
2025-05-07T20:23:13.0001769Z cpu MHz		: 3284.425
2025-05-07T20:23:13.0001973Z cache size	: 512 KB
2025-05-07T20:23:13.0002179Z physical id	: 0
2025-05-07T20:23:13.0002373Z siblings	: 16
2025-05-07T20:23:13.0002668Z core id		: 2
2025-05-07T20:23:13.0002858Z cpu cores	: 8
2025-05-07T20:23:13.0003049Z apicid		: 4
2025-05-07T20:23:13.0003238Z initial apicid	: 4
2025-05-07T20:23:13.0003443Z fpu		: yes
2025-05-07T20:23:13.0003629Z fpu_exception	: yes
2025-05-07T20:23:13.0004211Z cpuid level	: 13
2025-05-07T20:23:13.0004415Z wp		: yes
2025-05-07T20:23:13.0006509Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:13.0008740Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:13.0009221Z bogomips	: 5600.00
2025-05-07T20:23:13.0009436Z TLB size	: 3072 4K pages
2025-05-07T20:23:13.0009672Z clflush size	: 64
2025-05-07T20:23:13.0009878Z cache_alignment	: 64
2025-05-07T20:23:13.0010143Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:13.0010455Z power management:
2025-05-07T20:23:13.0010585Z 
2025-05-07T20:23:13.0010668Z processor	: 3
2025-05-07T20:23:13.0010915Z vendor_id	: AuthenticAMD
2025-05-07T20:23:13.0011173Z cpu family	: 23
2025-05-07T20:23:13.0011362Z model		: 49
2025-05-07T20:23:13.0011562Z model name	: AMD EPYC 7R32
2025-05-07T20:23:13.0011797Z stepping	: 0
2025-05-07T20:23:13.0012004Z microcode	: 0x830107f
2025-05-07T20:23:13.0012217Z cpu MHz		: 3301.887
2025-05-07T20:23:13.0012434Z cache size	: 512 KB
2025-05-07T20:23:13.0012646Z physical id	: 0
2025-05-07T20:23:13.0012842Z siblings	: 16
2025-05-07T20:23:13.0013031Z core id		: 3
2025-05-07T20:23:13.0013224Z cpu cores	: 8
2025-05-07T20:23:13.0013411Z apicid		: 6
2025-05-07T20:23:13.0013598Z initial apicid	: 6
2025-05-07T20:23:13.0013802Z fpu		: yes
2025-05-07T20:23:13.0013986Z fpu_exception	: yes
2025-05-07T20:23:13.0014200Z cpuid level	: 13
2025-05-07T20:23:13.0014400Z wp		: yes
2025-05-07T20:23:13.0016336Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:13.0018560Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:13.0019037Z bogomips	: 5600.00
2025-05-07T20:23:13.0019250Z TLB size	: 3072 4K pages
2025-05-07T20:23:13.0019475Z clflush size	: 64
2025-05-07T20:23:13.0019682Z cache_alignment	: 64
2025-05-07T20:23:13.0019945Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:13.0020248Z power management:
2025-05-07T20:23:13.0020376Z 
2025-05-07T20:23:13.0020457Z processor	: 4
2025-05-07T20:23:13.0020678Z vendor_id	: AuthenticAMD
2025-05-07T20:23:13.0020942Z cpu family	: 23
2025-05-07T20:23:13.0021138Z model		: 49
2025-05-07T20:23:13.0021338Z model name	: AMD EPYC 7R32
2025-05-07T20:23:13.0021574Z stepping	: 0
2025-05-07T20:23:13.0021777Z microcode	: 0x830107f
2025-05-07T20:23:13.0021986Z cpu MHz		: 3280.572
2025-05-07T20:23:13.0022193Z cache size	: 512 KB
2025-05-07T20:23:13.0022403Z physical id	: 0
2025-05-07T20:23:13.0022601Z siblings	: 16
2025-05-07T20:23:13.0022792Z core id		: 4
2025-05-07T20:23:13.0022981Z cpu cores	: 8
2025-05-07T20:23:13.0023189Z apicid		: 8
2025-05-07T20:23:13.0023501Z initial apicid	: 8
2025-05-07T20:23:13.0033522Z fpu		: yes
2025-05-07T20:23:13.0033799Z fpu_exception	: yes
2025-05-07T20:23:13.0034022Z cpuid level	: 13
2025-05-07T20:23:13.0034234Z wp		: yes
2025-05-07T20:23:13.0036310Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:13.0038572Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:13.0039053Z bogomips	: 5600.00
2025-05-07T20:23:13.0039283Z TLB size	: 3072 4K pages
2025-05-07T20:23:13.0039530Z clflush size	: 64
2025-05-07T20:23:13.0039743Z cache_alignment	: 64
2025-05-07T20:23:13.0040018Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:13.0040337Z power management:
2025-05-07T20:23:13.0040472Z 
2025-05-07T20:23:13.0040564Z processor	: 5
2025-05-07T20:23:13.0040774Z vendor_id	: AuthenticAMD
2025-05-07T20:23:13.0041017Z cpu family	: 23
2025-05-07T20:23:13.0041231Z model		: 49
2025-05-07T20:23:13.0041435Z model name	: AMD EPYC 7R32
2025-05-07T20:23:13.0041685Z stepping	: 0
2025-05-07T20:23:13.0041899Z microcode	: 0x830107f
2025-05-07T20:23:13.0042120Z cpu MHz		: 3318.665
2025-05-07T20:23:13.0042339Z cache size	: 512 KB
2025-05-07T20:23:13.0042557Z physical id	: 0
2025-05-07T20:23:13.0042758Z siblings	: 16
2025-05-07T20:23:13.0042960Z core id		: 5
2025-05-07T20:23:13.0043159Z cpu cores	: 8
2025-05-07T20:23:13.0043352Z apicid		: 10
2025-05-07T20:23:13.0043561Z initial apicid	: 10
2025-05-07T20:23:13.0043774Z fpu		: yes
2025-05-07T20:23:13.0043972Z fpu_exception	: yes
2025-05-07T20:23:13.0044192Z cpuid level	: 13
2025-05-07T20:23:13.0044401Z wp		: yes
2025-05-07T20:23:13.0046336Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:13.0048553Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:13.0049047Z bogomips	: 5600.00
2025-05-07T20:23:13.0049277Z TLB size	: 3072 4K pages
2025-05-07T20:23:13.0049517Z clflush size	: 64
2025-05-07T20:23:13.0049729Z cache_alignment	: 64
2025-05-07T20:23:13.0049999Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:13.0050316Z power management:
2025-05-07T20:23:13.0050445Z 
2025-05-07T20:23:13.0050528Z processor	: 6
2025-05-07T20:23:13.0050738Z vendor_id	: AuthenticAMD
2025-05-07T20:23:13.0050967Z cpu family	: 23
2025-05-07T20:23:13.0051160Z model		: 49
2025-05-07T20:23:13.0051365Z model name	: AMD EPYC 7R32
2025-05-07T20:23:13.0051599Z stepping	: 0
2025-05-07T20:23:13.0051798Z microcode	: 0x830107f
2025-05-07T20:23:13.0052024Z cpu MHz		: 3304.334
2025-05-07T20:23:13.0052231Z cache size	: 512 KB
2025-05-07T20:23:13.0052437Z physical id	: 0
2025-05-07T20:23:13.0052636Z siblings	: 16
2025-05-07T20:23:13.0052830Z core id		: 6
2025-05-07T20:23:13.0053019Z cpu cores	: 8
2025-05-07T20:23:13.0053214Z apicid		: 12
2025-05-07T20:23:13.0053415Z initial apicid	: 12
2025-05-07T20:23:13.0053618Z fpu		: yes
2025-05-07T20:23:13.0053813Z fpu_exception	: yes
2025-05-07T20:23:13.0054027Z cpuid level	: 13
2025-05-07T20:23:13.0054313Z wp		: yes
2025-05-07T20:23:13.0056355Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:13.0058587Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:13.0059070Z bogomips	: 5600.00
2025-05-07T20:23:13.0059292Z TLB size	: 3072 4K pages
2025-05-07T20:23:13.0059514Z clflush size	: 64
2025-05-07T20:23:13.0059727Z cache_alignment	: 64
2025-05-07T20:23:13.0059994Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:13.0060305Z power management:
2025-05-07T20:23:13.0060444Z 
2025-05-07T20:23:13.0060525Z processor	: 7
2025-05-07T20:23:13.0060762Z vendor_id	: AuthenticAMD
2025-05-07T20:23:13.0061014Z cpu family	: 23
2025-05-07T20:23:13.0061219Z model		: 49
2025-05-07T20:23:13.0061430Z model name	: AMD EPYC 7R32
2025-05-07T20:23:13.0061666Z stepping	: 0
2025-05-07T20:23:13.0061867Z microcode	: 0x830107f
2025-05-07T20:23:13.0062085Z cpu MHz		: 3297.610
2025-05-07T20:23:13.0062303Z cache size	: 512 KB
2025-05-07T20:23:13.0062511Z physical id	: 0
2025-05-07T20:23:13.0062718Z siblings	: 16
2025-05-07T20:23:13.0062916Z core id		: 7
2025-05-07T20:23:13.0063106Z cpu cores	: 8
2025-05-07T20:23:13.0063306Z apicid		: 14
2025-05-07T20:23:13.0063508Z initial apicid	: 14
2025-05-07T20:23:13.0063709Z fpu		: yes
2025-05-07T20:23:13.0063901Z fpu_exception	: yes
2025-05-07T20:23:13.0064112Z cpuid level	: 13
2025-05-07T20:23:13.0064304Z wp		: yes
2025-05-07T20:23:13.0066252Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:13.0068482Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:13.0068971Z bogomips	: 5600.00
2025-05-07T20:23:13.0069180Z TLB size	: 3072 4K pages
2025-05-07T20:23:13.0069409Z clflush size	: 64
2025-05-07T20:23:13.0069728Z cache_alignment	: 64
2025-05-07T20:23:13.0069987Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:13.0070296Z power management:
2025-05-07T20:23:13.0070434Z 
2025-05-07T20:23:13.0070512Z processor	: 8
2025-05-07T20:23:13.0070722Z vendor_id	: AuthenticAMD
2025-05-07T20:23:13.0070950Z cpu family	: 23
2025-05-07T20:23:13.0071180Z model		: 49
2025-05-07T20:23:13.0071409Z model name	: AMD EPYC 7R32
2025-05-07T20:23:13.0071642Z stepping	: 0
2025-05-07T20:23:13.0071844Z microcode	: 0x830107f
2025-05-07T20:23:13.0072065Z cpu MHz		: 3302.130
2025-05-07T20:23:13.0072267Z cache size	: 512 KB
2025-05-07T20:23:13.0072488Z physical id	: 0
2025-05-07T20:23:13.0072706Z siblings	: 16
2025-05-07T20:23:13.0072905Z core id		: 0
2025-05-07T20:23:13.0073100Z cpu cores	: 8
2025-05-07T20:23:13.0073302Z apicid		: 1
2025-05-07T20:23:13.0073491Z initial apicid	: 1
2025-05-07T20:23:13.0073704Z fpu		: yes
2025-05-07T20:23:13.0073890Z fpu_exception	: yes
2025-05-07T20:23:13.0074094Z cpuid level	: 13
2025-05-07T20:23:13.0074292Z wp		: yes
2025-05-07T20:23:13.0076229Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:13.0078748Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:13.0079231Z bogomips	: 5600.00
2025-05-07T20:23:13.0079440Z TLB size	: 3072 4K pages
2025-05-07T20:23:13.0079672Z clflush size	: 64
2025-05-07T20:23:13.0079886Z cache_alignment	: 64
2025-05-07T20:23:13.0080140Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:13.0080452Z power management:
2025-05-07T20:23:13.0080578Z 
2025-05-07T20:23:13.0080659Z processor	: 9
2025-05-07T20:23:13.0080863Z vendor_id	: AuthenticAMD
2025-05-07T20:23:13.0081090Z cpu family	: 23
2025-05-07T20:23:13.0081283Z model		: 49
2025-05-07T20:23:13.0081471Z model name	: AMD EPYC 7R32
2025-05-07T20:23:13.0081703Z stepping	: 0
2025-05-07T20:23:13.0081901Z microcode	: 0x830107f
2025-05-07T20:23:13.0082120Z cpu MHz		: 3286.636
2025-05-07T20:23:13.0082316Z cache size	: 512 KB
2025-05-07T20:23:13.0082517Z physical id	: 0
2025-05-07T20:23:13.0082718Z siblings	: 16
2025-05-07T20:23:13.0082904Z core id		: 1
2025-05-07T20:23:13.0083101Z cpu cores	: 8
2025-05-07T20:23:13.0083293Z apicid		: 3
2025-05-07T20:23:13.0083475Z initial apicid	: 3
2025-05-07T20:23:13.0083677Z fpu		: yes
2025-05-07T20:23:13.0083870Z fpu_exception	: yes
2025-05-07T20:23:13.0084078Z cpuid level	: 13
2025-05-07T20:23:13.0084279Z wp		: yes
2025-05-07T20:23:13.0086219Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:13.0088458Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:13.0088926Z bogomips	: 5600.00
2025-05-07T20:23:13.0089139Z TLB size	: 3072 4K pages
2025-05-07T20:23:13.0089364Z clflush size	: 64
2025-05-07T20:23:13.0089567Z cache_alignment	: 64
2025-05-07T20:23:13.0089879Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:13.0090286Z power management:
2025-05-07T20:23:13.0090413Z 
2025-05-07T20:23:13.0090499Z processor	: 10
2025-05-07T20:23:13.0090729Z vendor_id	: AuthenticAMD
2025-05-07T20:23:13.0090992Z cpu family	: 23
2025-05-07T20:23:13.0091184Z model		: 49
2025-05-07T20:23:13.0091379Z model name	: AMD EPYC 7R32
2025-05-07T20:23:13.0091614Z stepping	: 0
2025-05-07T20:23:13.0091811Z microcode	: 0x830107f
2025-05-07T20:23:13.0092024Z cpu MHz		: 3275.585
2025-05-07T20:23:13.0092230Z cache size	: 512 KB
2025-05-07T20:23:13.0092429Z physical id	: 0
2025-05-07T20:23:13.0092625Z siblings	: 16
2025-05-07T20:23:13.0092814Z core id		: 2
2025-05-07T20:23:13.0092999Z cpu cores	: 8
2025-05-07T20:23:13.0093184Z apicid		: 5
2025-05-07T20:23:13.0093379Z initial apicid	: 5
2025-05-07T20:23:13.0093584Z fpu		: yes
2025-05-07T20:23:13.0093763Z fpu_exception	: yes
2025-05-07T20:23:13.0093966Z cpuid level	: 13
2025-05-07T20:23:13.0094162Z wp		: yes
2025-05-07T20:23:13.0096142Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:13.0098451Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:13.0099107Z bogomips	: 5600.00
2025-05-07T20:23:13.0099513Z TLB size	: 3072 4K pages
2025-05-07T20:23:13.0099831Z clflush size	: 64
2025-05-07T20:23:13.0100051Z cache_alignment	: 64
2025-05-07T20:23:13.0100317Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:13.0100653Z power management:
2025-05-07T20:23:13.0100859Z 
2025-05-07T20:23:13.0100996Z processor	: 11
2025-05-07T20:23:13.0101290Z vendor_id	: AuthenticAMD
2025-05-07T20:23:13.0101624Z cpu family	: 23
2025-05-07T20:23:13.0101836Z model		: 49
2025-05-07T20:23:13.0102053Z model name	: AMD EPYC 7R32
2025-05-07T20:23:13.0102287Z stepping	: 0
2025-05-07T20:23:13.0102483Z microcode	: 0x830107f
2025-05-07T20:23:13.0102700Z cpu MHz		: 3269.732
2025-05-07T20:23:13.0102908Z cache size	: 512 KB
2025-05-07T20:23:13.0103177Z physical id	: 0
2025-05-07T20:23:13.0103456Z siblings	: 16
2025-05-07T20:23:13.0103935Z core id		: 3
2025-05-07T20:23:13.0104193Z cpu cores	: 8
2025-05-07T20:23:13.0104449Z apicid		: 7
2025-05-07T20:23:13.0104713Z initial apicid	: 7
2025-05-07T20:23:13.0104996Z fpu		: yes
2025-05-07T20:23:13.0105252Z fpu_exception	: yes
2025-05-07T20:23:13.0105539Z cpuid level	: 13
2025-05-07T20:23:13.0105754Z wp		: yes
2025-05-07T20:23:13.0107695Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:13.0109999Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:13.0110477Z bogomips	: 5600.00
2025-05-07T20:23:13.0110693Z TLB size	: 3072 4K pages
2025-05-07T20:23:13.0110921Z clflush size	: 64
2025-05-07T20:23:13.0111133Z cache_alignment	: 64
2025-05-07T20:23:13.0111400Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:13.0111705Z power management:
2025-05-07T20:23:13.0111841Z 
2025-05-07T20:23:13.0111922Z processor	: 12
2025-05-07T20:23:13.0112131Z vendor_id	: AuthenticAMD
2025-05-07T20:23:13.0112358Z cpu family	: 23
2025-05-07T20:23:13.0112554Z model		: 49
2025-05-07T20:23:13.0112752Z model name	: AMD EPYC 7R32
2025-05-07T20:23:13.0112991Z stepping	: 0
2025-05-07T20:23:13.0113184Z microcode	: 0x830107f
2025-05-07T20:23:13.0113401Z cpu MHz		: 3263.783
2025-05-07T20:23:13.0113612Z cache size	: 512 KB
2025-05-07T20:23:13.0113815Z physical id	: 0
2025-05-07T20:23:13.0114042Z siblings	: 16
2025-05-07T20:23:13.0114314Z core id		: 4
2025-05-07T20:23:13.0114575Z cpu cores	: 8
2025-05-07T20:23:13.0114840Z apicid		: 9
2025-05-07T20:23:13.0115102Z initial apicid	: 9
2025-05-07T20:23:13.0115389Z fpu		: yes
2025-05-07T20:23:13.0115665Z fpu_exception	: yes
2025-05-07T20:23:13.0115964Z cpuid level	: 13
2025-05-07T20:23:13.0116237Z wp		: yes
2025-05-07T20:23:13.0118712Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:13.0121133Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:13.0121609Z bogomips	: 5600.00
2025-05-07T20:23:13.0121816Z TLB size	: 3072 4K pages
2025-05-07T20:23:13.0122039Z clflush size	: 64
2025-05-07T20:23:13.0122246Z cache_alignment	: 64
2025-05-07T20:23:13.0122634Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:13.0122933Z power management:
2025-05-07T20:23:13.0123064Z 
2025-05-07T20:23:13.0123143Z processor	: 13
2025-05-07T20:23:13.0123347Z vendor_id	: AuthenticAMD
2025-05-07T20:23:13.0123565Z cpu family	: 23
2025-05-07T20:23:13.0123754Z model		: 49
2025-05-07T20:23:13.0123950Z model name	: AMD EPYC 7R32
2025-05-07T20:23:13.0124173Z stepping	: 0
2025-05-07T20:23:13.0124366Z microcode	: 0x830107f
2025-05-07T20:23:13.0124586Z cpu MHz		: 3300.457
2025-05-07T20:23:13.0124785Z cache size	: 512 KB
2025-05-07T20:23:13.0124992Z physical id	: 0
2025-05-07T20:23:13.0125191Z siblings	: 16
2025-05-07T20:23:13.0125374Z core id		: 5
2025-05-07T20:23:13.0125564Z cpu cores	: 8
2025-05-07T20:23:13.0125752Z apicid		: 11
2025-05-07T20:23:13.0126010Z initial apicid	: 11
2025-05-07T20:23:13.0126298Z fpu		: yes
2025-05-07T20:23:13.0126554Z fpu_exception	: yes
2025-05-07T20:23:13.0126833Z cpuid level	: 13
2025-05-07T20:23:13.0127116Z wp		: yes
2025-05-07T20:23:13.0129820Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:13.0132040Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:13.0132509Z bogomips	: 5600.00
2025-05-07T20:23:13.0132713Z TLB size	: 3072 4K pages
2025-05-07T20:23:13.0132936Z clflush size	: 64
2025-05-07T20:23:13.0133140Z cache_alignment	: 64
2025-05-07T20:23:13.0133394Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:13.0133712Z power management:
2025-05-07T20:23:13.0133843Z 
2025-05-07T20:23:13.0133926Z processor	: 14
2025-05-07T20:23:13.0134125Z vendor_id	: AuthenticAMD
2025-05-07T20:23:13.0134347Z cpu family	: 23
2025-05-07T20:23:13.0134539Z model		: 49
2025-05-07T20:23:13.0134725Z model name	: AMD EPYC 7R32
2025-05-07T20:23:13.0134952Z stepping	: 0
2025-05-07T20:23:13.0135152Z microcode	: 0x830107f
2025-05-07T20:23:13.0135358Z cpu MHz		: 3285.956
2025-05-07T20:23:13.0135561Z cache size	: 512 KB
2025-05-07T20:23:13.0135768Z physical id	: 0
2025-05-07T20:23:13.0135956Z siblings	: 16
2025-05-07T20:23:13.0136144Z core id		: 6
2025-05-07T20:23:13.0136333Z cpu cores	: 8
2025-05-07T20:23:13.0136518Z apicid		: 13
2025-05-07T20:23:13.0136710Z initial apicid	: 13
2025-05-07T20:23:13.0136910Z fpu		: yes
2025-05-07T20:23:13.0137093Z fpu_exception	: yes
2025-05-07T20:23:13.0137296Z cpuid level	: 13
2025-05-07T20:23:13.0137491Z wp		: yes
2025-05-07T20:23:13.0139802Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:13.0143192Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:13.0143660Z bogomips	: 5600.00
2025-05-07T20:23:13.0143871Z TLB size	: 3072 4K pages
2025-05-07T20:23:13.0144092Z clflush size	: 64
2025-05-07T20:23:13.0144294Z cache_alignment	: 64
2025-05-07T20:23:13.0144552Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:13.0144858Z power management:
2025-05-07T20:23:13.0144984Z 
2025-05-07T20:23:13.0145158Z processor	: 15
2025-05-07T20:23:13.0145367Z vendor_id	: AuthenticAMD
2025-05-07T20:23:13.0145593Z cpu family	: 23
2025-05-07T20:23:13.0145784Z model		: 49
2025-05-07T20:23:13.0145981Z model name	: AMD EPYC 7R32
2025-05-07T20:23:13.0146208Z stepping	: 0
2025-05-07T20:23:13.0146406Z microcode	: 0x830107f
2025-05-07T20:23:13.0146617Z cpu MHz		: 3291.566
2025-05-07T20:23:13.0146827Z cache size	: 512 KB
2025-05-07T20:23:13.0147028Z physical id	: 0
2025-05-07T20:23:13.0147236Z siblings	: 16
2025-05-07T20:23:13.0147431Z core id		: 7
2025-05-07T20:23:13.0147614Z cpu cores	: 8
2025-05-07T20:23:13.0147805Z apicid		: 15
2025-05-07T20:23:13.0147997Z initial apicid	: 15
2025-05-07T20:23:13.0148196Z fpu		: yes
2025-05-07T20:23:13.0148383Z fpu_exception	: yes
2025-05-07T20:23:13.0148594Z cpuid level	: 13
2025-05-07T20:23:13.0148787Z wp		: yes
2025-05-07T20:23:13.0150832Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:13.0153665Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:13.0154337Z bogomips	: 5600.00
2025-05-07T20:23:13.0154626Z TLB size	: 3072 4K pages
2025-05-07T20:23:13.0154856Z clflush size	: 64
2025-05-07T20:23:13.0155061Z cache_alignment	: 64
2025-05-07T20:23:13.0155314Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:13.0155620Z power management:
2025-05-07T20:23:13.0155751Z 
2025-05-07T20:23:13.0155756Z 
2025-05-07T20:23:13.0155871Z ################################################################################
2025-05-07T20:23:13.0156209Z [INFO] Print PCI info ...
2025-05-07T20:23:13.0156458Z + lspci -v
2025-05-07T20:23:13.0156583Z 
2025-05-07T20:23:13.0156790Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:13.0157162Z 	Subsystem: Amazon.com, Inc. Device 1237
2025-05-07T20:23:13.0157469Z 	Flags: bus master, medium devsel, latency 0
2025-05-07T20:23:13.0157672Z 
2025-05-07T20:23:13.0157871Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:13.0158236Z 	Physical Slot: 1
2025-05-07T20:23:13.0158469Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:13.0158665Z 
2025-05-07T20:23:13.0158912Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:13.0159327Z 	Physical Slot: 1
2025-05-07T20:23:13.0159575Z 	Flags: bus master, fast devsel, latency 0, IRQ 9
2025-05-07T20:23:13.0159791Z 
2025-05-07T20:23:13.0160054Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller])
2025-05-07T20:23:13.0160484Z 	Physical Slot: 3
2025-05-07T20:23:13.0160714Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:13.0161042Z 	Memory at c1000000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:13.0161388Z 	Expansion ROM at 000c0000 [disabled] [size=128K]
2025-05-07T20:23:13.0161605Z 
2025-05-07T20:23:13.0161897Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:13.0162503Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:13.0162778Z 	Physical Slot: 4
2025-05-07T20:23:13.0163027Z 	Flags: bus master, fast devsel, latency 0, IRQ 11
2025-05-07T20:23:13.0163398Z 	Memory at c1808000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:13.0163742Z 	Capabilities: <access denied>
2025-05-07T20:23:13.0164003Z 	Kernel driver in use: nvme
2025-05-07T20:23:13.0164160Z 
2025-05-07T20:23:13.0164450Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:13.0164924Z 	Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:13.0165259Z 	Physical Slot: 5
2025-05-07T20:23:13.0165493Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:13.0165845Z 	Memory at c1804000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:13.0166217Z 	Memory at c1400000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:13.0166526Z 	Capabilities: <access denied>
2025-05-07T20:23:13.0166789Z 	Kernel driver in use: ena
2025-05-07T20:23:13.0167029Z 	Kernel modules: ena
2025-05-07T20:23:13.0167165Z 
2025-05-07T20:23:13.0167334Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:13.0167696Z 	Subsystem: NVIDIA Corporation Device 152f
2025-05-07T20:23:13.0167982Z 	Physical Slot: 30
2025-05-07T20:23:13.0168232Z 	Flags: bus master, fast devsel, latency 0, IRQ 10
2025-05-07T20:23:13.0168594Z 	Memory at c0000000 (32-bit, non-prefetchable) [size=16M]
2025-05-07T20:23:13.0168975Z 	Memory at 1800000000 (64-bit, prefetchable) [size=32G]
2025-05-07T20:23:13.0169466Z 	Memory at 1040000000 (64-bit, prefetchable) [size=32M]
2025-05-07T20:23:13.0169907Z 	Capabilities: <access denied>
2025-05-07T20:23:13.0170264Z 	Kernel driver in use: nvidia
2025-05-07T20:23:13.0170585Z 	Kernel modules: nvidia
2025-05-07T20:23:13.0170730Z 
2025-05-07T20:23:13.0171028Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:13.0171526Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:13.0171805Z 	Physical Slot: 31
2025-05-07T20:23:13.0172043Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:13.0172380Z 	Memory at c1800000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:13.0172754Z 	Memory at c180c000 (32-bit, prefetchable) [size=8K]
2025-05-07T20:23:13.0173068Z 	Capabilities: <access denied>
2025-05-07T20:23:13.0173319Z 	Kernel driver in use: nvme
2025-05-07T20:23:13.0173480Z 
2025-05-07T20:23:13.0173485Z 
2025-05-07T20:23:13.0173599Z ################################################################################
2025-05-07T20:23:13.0173909Z [INFO] Print Linux distribution info ...
2025-05-07T20:23:13.0174187Z + uname -a
2025-05-07T20:23:13.0174294Z 
2025-05-07T20:23:13.0174689Z Linux ip-10-0-73-154.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
2025-05-07T20:23:13.0175178Z 
2025-05-07T20:23:13.0175254Z + uname -m
2025-05-07T20:23:13.0175364Z 
2025-05-07T20:23:13.0175439Z x86_64
2025-05-07T20:23:13.0175542Z 
2025-05-07T20:23:13.0175627Z + cat /proc/version
2025-05-07T20:23:13.0175765Z 
2025-05-07T20:23:13.0176291Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025
2025-05-07T20:23:13.0176908Z 
2025-05-07T20:23:13.0176993Z + cat /etc/os-release
2025-05-07T20:23:13.0177131Z 
2025-05-07T20:23:13.0177224Z NAME="Amazon Linux"
2025-05-07T20:23:13.0177428Z VERSION="2023"
2025-05-07T20:23:13.0177630Z ID="amzn"
2025-05-07T20:23:13.0177818Z ID_LIKE="fedora"
2025-05-07T20:23:13.0178015Z VERSION_ID="2023"
2025-05-07T20:23:13.0178242Z PLATFORM_ID="platform:al2023"
2025-05-07T20:23:13.0178519Z PRETTY_NAME="Amazon Linux 2023.6.20250317"
2025-05-07T20:23:13.0186147Z ANSI_COLOR="0;33"
2025-05-07T20:23:13.0186434Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
2025-05-07T20:23:13.0186947Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
2025-05-07T20:23:13.0187378Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
2025-05-07T20:23:13.0187792Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
2025-05-07T20:23:13.0188225Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
2025-05-07T20:23:13.0188592Z VENDOR_NAME="AWS"
2025-05-07T20:23:13.0188834Z VENDOR_URL="https://aws.amazon.com/"
2025-05-07T20:23:13.0189116Z SUPPORT_END="2029-06-30"
2025-05-07T20:23:13.0189274Z 
2025-05-07T20:23:13.0189502Z ################################################################################
2025-05-07T20:23:13.0189908Z # Print EC2 Instance Info
2025-05-07T20:23:13.0190141Z #
2025-05-07T20:23:13.0190354Z # [2025-05-07T20:23:13.016Z] + print_ec2_info 
2025-05-07T20:23:13.0190667Z ################################################################################
2025-05-07T20:23:13.0190875Z 
2025-05-07T20:23:13.0289197Z ami-id: ami-071226ecf16aa7d96
2025-05-07T20:23:13.0412854Z instance-id: i-00cb9561c833cfdb2
2025-05-07T20:23:13.0527099Z instance-type: g5.4xlarge
2025-05-07T20:23:13.0572018Z ##[group]Run . $PRELUDE; print_gpu_info
2025-05-07T20:23:13.0572373Z [36;1m. $PRELUDE; print_gpu_info[0m
2025-05-07T20:23:13.0581803Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:13.0582154Z env:
2025-05-07T20:23:13.0582373Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:13.0582677Z   BUILD_ENV: build_binary
2025-05-07T20:23:13.0582926Z   BUILD_TARGET: genai
2025-05-07T20:23:13.0583149Z   BUILD_VARIANT: cuda
2025-05-07T20:23:13.0583391Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:13.0583644Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:13.0583945Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:13.0584275Z ##[endgroup]
2025-05-07T20:23:13.3930697Z ################################################################################
2025-05-07T20:23:13.3931248Z [INFO] Printing general display info ...
2025-05-07T20:23:13.3961500Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:13.5084727Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:13.5095730Z /usr/bin/sudo
2025-05-07T20:23:13.5106617Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:13.5116843Z /usr/bin/yum
2025-05-07T20:23:13.5118627Z [INSTALL] Updating system repositories ...
2025-05-07T20:23:13.5139399Z [EXEC] [ATTEMPT 0/3]    + sudo yum update -y
2025-05-07T20:23:13.9555190Z Last metadata expiration check: 0:00:08 ago on Wed May  7 20:23:05 2025.
2025-05-07T20:23:14.0307718Z ================================================================================
2025-05-07T20:23:14.0308078Z WARNING:
2025-05-07T20:23:14.0308317Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:14.0308548Z 
2025-05-07T20:23:14.0308645Z   Available Versions:
2025-05-07T20:23:14.0308787Z 
2025-05-07T20:23:14.0308873Z   Version 2023.7.20250331:
2025-05-07T20:23:14.0309180Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:14.0309467Z 
2025-05-07T20:23:14.0309658Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:14.0309866Z 
2025-05-07T20:23:14.0309953Z     Release notes:
2025-05-07T20:23:14.0310346Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:14.0310718Z 
2025-05-07T20:23:14.0310803Z   Version 2023.7.20250414:
2025-05-07T20:23:14.0311105Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:14.0311380Z 
2025-05-07T20:23:14.0311506Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:14.0311714Z 
2025-05-07T20:23:14.0311795Z     Release notes:
2025-05-07T20:23:14.0312179Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:14.0312539Z 
2025-05-07T20:23:14.0312630Z   Version 2023.7.20250428:
2025-05-07T20:23:14.0312923Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:14.0313168Z 
2025-05-07T20:23:14.0313520Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:14.0313728Z 
2025-05-07T20:23:14.0313818Z     Release notes:
2025-05-07T20:23:14.0314204Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:14.0314561Z 
2025-05-07T20:23:14.0314678Z ================================================================================
2025-05-07T20:23:14.1476090Z Dependencies resolved.
2025-05-07T20:23:14.1762502Z ================================================================================
2025-05-07T20:23:14.1762905Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:14.1763266Z ================================================================================
2025-05-07T20:23:14.1763559Z Upgrading:
2025-05-07T20:23:14.1763913Z  nvidia-container-toolkit      x86_64 1.17.6-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:14.1764481Z  nvidia-container-toolkit-base x86_64 1.17.6-1   nvidia-container-toolkit 5.7 M
2025-05-07T20:23:14.1764842Z 
2025-05-07T20:23:14.1765134Z Transaction Summary
2025-05-07T20:23:14.1765382Z ================================================================================
2025-05-07T20:23:14.1765680Z Upgrade  2 Packages
2025-05-07T20:23:14.1765814Z 
2025-05-07T20:23:14.1765933Z Total download size: 6.9 M
2025-05-07T20:23:14.1767602Z Downloading Packages:
2025-05-07T20:23:14.2318707Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64  23 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:14.2668403Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x  64 MB/s | 5.7 MB     00:00    
2025-05-07T20:23:14.2676025Z --------------------------------------------------------------------------------
2025-05-07T20:23:14.2679271Z Total                                            76 MB/s | 6.9 MB     00:00     
2025-05-07T20:23:14.2681639Z Running transaction check
2025-05-07T20:23:14.2779918Z Transaction check succeeded.
2025-05-07T20:23:14.2780516Z Running transaction test
2025-05-07T20:23:14.3077415Z Transaction test succeeded.
2025-05-07T20:23:14.3080007Z Running transaction
2025-05-07T20:23:14.8608921Z   Preparing        :                                                        1/1 
2025-05-07T20:23:14.9666551Z   Upgrading        : nvidia-container-toolkit-base-1.17.6-1.x86_64          1/4 
2025-05-07T20:23:14.9693638Z   Upgrading        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:14.9895919Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:14.9897295Z   Cleanup          : nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:15.0008014Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:15.0035975Z   Cleanup          : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:15.1497769Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               4/4 
2025-05-07T20:23:15.1498930Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               1/4 
2025-05-07T20:23:15.1500031Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:15.1501057Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          3/4 
2025-05-07T20:23:15.2903244Z ================================================================================
2025-05-07T20:23:15.2903634Z WARNING:
2025-05-07T20:23:15.2904088Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:15.2904316Z 
2025-05-07T20:23:15.2904423Z   Available Versions:
2025-05-07T20:23:15.2904569Z 
2025-05-07T20:23:15.2904663Z   Version 2023.7.20250331:
2025-05-07T20:23:15.2904968Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:15.2905223Z 
2025-05-07T20:23:15.2905345Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:15.2905551Z 
2025-05-07T20:23:15.2905646Z     Release notes:
2025-05-07T20:23:15.2906059Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:15.2906689Z 
2025-05-07T20:23:15.2906794Z   Version 2023.7.20250414:
2025-05-07T20:23:15.2907101Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:15.2907348Z 
2025-05-07T20:23:15.2907471Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:15.2907677Z 
2025-05-07T20:23:15.2907759Z     Release notes:
2025-05-07T20:23:15.2908158Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:15.2908523Z 
2025-05-07T20:23:15.2908619Z   Version 2023.7.20250428:
2025-05-07T20:23:15.2908920Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:15.2909174Z 
2025-05-07T20:23:15.2909285Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:15.2909494Z 
2025-05-07T20:23:15.2909644Z     Release notes:
2025-05-07T20:23:15.2910030Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:15.2910391Z 
2025-05-07T20:23:15.2910703Z ================================================================================
2025-05-07T20:23:15.3475346Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:15.3476285Z 
2025-05-07T20:23:15.3476505Z Upgraded:
2025-05-07T20:23:15.3477471Z   nvidia-container-toolkit-1.17.6-1.x86_64                                      
2025-05-07T20:23:15.3479185Z   nvidia-container-toolkit-base-1.17.6-1.x86_64                                 
2025-05-07T20:23:15.3480193Z 
2025-05-07T20:23:15.3480413Z Complete!
2025-05-07T20:23:15.3939774Z [INSTALL] Installing system package(s): hostname lshw ...
2025-05-07T20:23:15.3961505Z [EXEC] [ATTEMPT 0/3]    + sudo yum install -y hostname lshw
2025-05-07T20:23:15.8489949Z Last metadata expiration check: 0:00:10 ago on Wed May  7 20:23:05 2025.
2025-05-07T20:23:15.8730526Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed.
2025-05-07T20:23:15.9135351Z Dependencies resolved.
2025-05-07T20:23:15.9312617Z ================================================================================
2025-05-07T20:23:15.9313067Z  Package    Architecture Version                        Repository         Size
2025-05-07T20:23:15.9313477Z ================================================================================
2025-05-07T20:23:15.9313773Z Installing:
2025-05-07T20:23:15.9314058Z  lshw       x86_64       B.02.19.2-7.amzn2023.0.3       amazonlinux       319 k
2025-05-07T20:23:15.9314322Z 
2025-05-07T20:23:15.9314414Z Transaction Summary
2025-05-07T20:23:15.9314649Z ================================================================================
2025-05-07T20:23:15.9314943Z Install  1 Package
2025-05-07T20:23:15.9315078Z 
2025-05-07T20:23:15.9315394Z Total download size: 319 k
2025-05-07T20:23:15.9315747Z Installed size: 837 k
2025-05-07T20:23:15.9317540Z Downloading Packages:
2025-05-07T20:23:16.0101664Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64.rpm        6.5 MB/s | 319 kB     00:00    
2025-05-07T20:23:16.0107608Z --------------------------------------------------------------------------------
2025-05-07T20:23:16.0110539Z Total                                           3.9 MB/s | 319 kB     00:00     
2025-05-07T20:23:16.0264874Z Running transaction check
2025-05-07T20:23:16.0319291Z Transaction check succeeded.
2025-05-07T20:23:16.0319895Z Running transaction test
2025-05-07T20:23:16.0773978Z Transaction test succeeded.
2025-05-07T20:23:16.0777673Z Running transaction
2025-05-07T20:23:16.1819599Z   Preparing        :                                                        1/1 
2025-05-07T20:23:16.2355667Z   Installing       : lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:16.4124448Z   Running scriptlet: lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:16.5465345Z ================================================================================
2025-05-07T20:23:16.5465747Z WARNING:
2025-05-07T20:23:16.5465994Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:16.5466584Z 
2025-05-07T20:23:16.5466676Z   Available Versions:
2025-05-07T20:23:16.5466851Z 
2025-05-07T20:23:16.5466940Z   Version 2023.7.20250331:
2025-05-07T20:23:16.5467252Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:16.5467503Z 
2025-05-07T20:23:16.5467630Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:16.5467838Z 
2025-05-07T20:23:16.5467922Z     Release notes:
2025-05-07T20:23:16.5468328Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:16.5468698Z 
2025-05-07T20:23:16.5468791Z   Version 2023.7.20250414:
2025-05-07T20:23:16.5469088Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:16.5469339Z 
2025-05-07T20:23:16.5469450Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:16.5469748Z 
2025-05-07T20:23:16.5469831Z     Release notes:
2025-05-07T20:23:16.5470221Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:16.5470587Z 
2025-05-07T20:23:16.5470841Z   Version 2023.7.20250428:
2025-05-07T20:23:16.5471143Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:16.5471387Z 
2025-05-07T20:23:16.5471503Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:16.5471705Z 
2025-05-07T20:23:16.5471794Z     Release notes:
2025-05-07T20:23:16.5472172Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:16.5472535Z 
2025-05-07T20:23:16.5472657Z ================================================================================
2025-05-07T20:23:16.5810054Z   Verifying        : lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:16.5810387Z 
2025-05-07T20:23:16.5810472Z Installed:
2025-05-07T20:23:16.5810779Z   lshw-B.02.19.2-7.amzn2023.0.3.x86_64                                          
2025-05-07T20:23:16.5811063Z 
2025-05-07T20:23:16.5811156Z Complete!
2025-05-07T20:23:16.6285611Z + hostname
2025-05-07T20:23:16.6285797Z 
2025-05-07T20:23:16.6299074Z ip-10-0-73-154.ec2.internal
2025-05-07T20:23:16.6300230Z 
2025-05-07T20:23:16.6301019Z + sudo lshw -C display
2025-05-07T20:23:16.6301234Z 
2025-05-07T20:23:17.0502701Z   *-display:0 UNCLAIMED
2025-05-07T20:23:17.0503037Z        description: VGA compatible controller
2025-05-07T20:23:17.0503360Z        product: Amazon.com, Inc.
2025-05-07T20:23:17.0503636Z        vendor: Amazon.com, Inc.
2025-05-07T20:23:17.0504125Z        physical id: 3
2025-05-07T20:23:17.0504354Z        bus info: pci@0000:00:03.0
2025-05-07T20:23:17.0504606Z        version: 00
2025-05-07T20:23:17.0504815Z        width: 32 bits
2025-05-07T20:23:17.0505036Z        clock: 33MHz
2025-05-07T20:23:17.0505285Z        capabilities: vga_controller bus_master
2025-05-07T20:23:17.0505593Z        configuration: latency=0
2025-05-07T20:23:17.0505913Z        resources: memory:c1000000-c13fffff memory:c0000-dffff
2025-05-07T20:23:17.0506235Z   *-display:1
2025-05-07T20:23:17.0506450Z        description: 3D controller
2025-05-07T20:23:17.0506766Z        product: GA102GL [A10G]
2025-05-07T20:23:17.0507023Z        vendor: NVIDIA Corporation
2025-05-07T20:23:17.0507283Z        physical id: 1e
2025-05-07T20:23:17.0507515Z        bus info: pci@0000:00:1e.0
2025-05-07T20:23:17.0507757Z        version: a1
2025-05-07T20:23:17.0507970Z        width: 64 bits
2025-05-07T20:23:17.0508187Z        clock: 33MHz
2025-05-07T20:23:17.0508468Z        capabilities: pm pciexpress msix bus_master cap_list
2025-05-07T20:23:17.0508838Z        configuration: driver=nvidia latency=0
2025-05-07T20:23:17.0509457Z        resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff
2025-05-07T20:23:17.0542565Z 
2025-05-07T20:23:17.0542975Z ################################################################################
2025-05-07T20:23:17.0543487Z [INFO] Printing NVIDIA GPU info ...
2025-05-07T20:23:17.0675141Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:17.0843593Z Wed May  7 20:23:17 2025       
2025-05-07T20:23:17.0843982Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:17.0844474Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:17.0844953Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:17.0845441Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:17.0845957Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:17.0846375Z |                                         |                        |               MIG M. |
2025-05-07T20:23:17.0846706Z |=========================================+========================+======================|
2025-05-07T20:23:17.0922261Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:17.0922912Z |  0%   30C    P0             58W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:23:17.0923280Z |                                         |                        |                  N/A |
2025-05-07T20:23:17.0923670Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:17.0924065Z                                                                                          
2025-05-07T20:23:17.0924445Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:17.0924857Z | Processes:                                                                              |
2025-05-07T20:23:17.0925283Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:17.0925686Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:17.0926021Z |=========================================================================================|
2025-05-07T20:23:17.0926822Z |  No running processes found                                                             |
2025-05-07T20:23:17.0927282Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:17.2355609Z ################################################################################
2025-05-07T20:23:17.2355946Z [INFO] Printing AMD GPU info ...
2025-05-07T20:23:17.2497831Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:17.2498610Z [CHECK] rocminfo not found
2025-05-07T20:23:17.2507602Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:17.2508668Z [CHECK] rocm-smi not found
2025-05-07T20:23:17.2558429Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda
2025-05-07T20:23:17.2558856Z [36;1m. $PRELUDE; setup_miniconda $HOME/miniconda[0m
2025-05-07T20:23:17.2571936Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:17.2572291Z env:
2025-05-07T20:23:17.2572504Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:17.2572801Z   BUILD_ENV: build_binary
2025-05-07T20:23:17.2573043Z   BUILD_TARGET: genai
2025-05-07T20:23:17.2573261Z   BUILD_VARIANT: cuda
2025-05-07T20:23:17.2573491Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:17.2573740Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:17.2574031Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:17.2574357Z ##[endgroup]
2025-05-07T20:23:17.5919225Z ################################################################################
2025-05-07T20:23:17.5919569Z # Setup Miniconda
2025-05-07T20:23:17.5919781Z #
2025-05-07T20:23:17.5935761Z # [2025-05-07T20:23:17.593Z] + setup_miniconda /home/ec2-user/miniconda
2025-05-07T20:23:17.5936221Z ################################################################################
2025-05-07T20:23:17.5936467Z 
2025-05-07T20:23:17.5952353Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:17.6995409Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:17.6995765Z + mkdir -p /home/ec2-user/miniconda
2025-05-07T20:23:17.6995963Z 
2025-05-07T20:23:17.7013160Z 
2025-05-07T20:23:17.7013501Z [SETUP] Downloading the Miniconda installer ...
2025-05-07T20:23:17.7036189Z [EXEC] [ATTEMPT 0/3]    + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
2025-05-07T20:23:18.5697896Z [SETUP] Installing Miniconda ...
2025-05-07T20:23:18.5698619Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u
2025-05-07T20:23:18.5699109Z 
2025-05-07T20:23:18.5841855Z PREFIX=/home/ec2-user/miniconda
2025-05-07T20:23:19.0342314Z Unpacking payload ...
2025-05-07T20:23:19.5535269Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:20.3524196Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:22.4444672Z 
2025-05-07T20:23:22.4445015Z Installing base environment...
2025-05-07T20:23:22.4445233Z 
2025-05-07T20:23:23.5179612Z Preparing transaction: ...working... done
2025-05-07T20:23:26.5099633Z Executing transaction: ...working... done
2025-05-07T20:23:27.1765996Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:27.2659127Z installation finished.
2025-05-07T20:23:27.2667877Z 
2025-05-07T20:23:27.2668283Z + rm -f miniconda.sh
2025-05-07T20:23:27.2668525Z 
2025-05-07T20:23:27.2979328Z 
2025-05-07T20:23:27.2979744Z [SETUP] Reloading the bash configuration ...
2025-05-07T20:23:27.2980243Z + /home/ec2-user/miniconda/bin/conda init bash
2025-05-07T20:23:27.2980559Z 
2025-05-07T20:23:27.6628546Z no change     /home/ec2-user/miniconda/condabin/conda
2025-05-07T20:23:27.6628963Z no change     /home/ec2-user/miniconda/bin/conda
2025-05-07T20:23:27.6629431Z no change     /home/ec2-user/miniconda/bin/conda-env
2025-05-07T20:23:27.6629870Z no change     /home/ec2-user/miniconda/bin/activate
2025-05-07T20:23:27.6630221Z no change     /home/ec2-user/miniconda/bin/deactivate
2025-05-07T20:23:27.6630614Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.sh
2025-05-07T20:23:27.6631045Z no change     /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish
2025-05-07T20:23:27.6631474Z no change     /home/ec2-user/miniconda/shell/condabin/Conda.psm1
2025-05-07T20:23:27.6631923Z no change     /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1
2025-05-07T20:23:27.6632701Z no change     /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh
2025-05-07T20:23:27.6633220Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.csh
2025-05-07T20:23:27.6633582Z modified      /home/ec2-user/.bashrc
2025-05-07T20:23:27.6633777Z 
2025-05-07T20:23:27.6633968Z ==> For changes to take effect, close and re-open your current shell. <==
2025-05-07T20:23:27.6634258Z 
2025-05-07T20:23:27.7276320Z 
2025-05-07T20:23:27.7276889Z + . /home/ec2-user/.bashrc
2025-05-07T20:23:27.7277097Z 
2025-05-07T20:23:28.5626098Z 
2025-05-07T20:23:28.5626790Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ...
2025-05-07T20:23:28.5650724Z [EXEC] [ATTEMPT 0/3]    + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive
2025-05-07T20:23:42.0299860Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:23:43.5908410Z Solving environment: - \ | / - \ | / - \ | / done
2025-05-07T20:23:43.6882468Z 
2025-05-07T20:23:43.6882929Z ## Package Plan ##
2025-05-07T20:23:43.6883110Z 
2025-05-07T20:23:43.6883248Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:23:43.6883499Z 
2025-05-07T20:23:43.6883593Z   added / updated specs:
2025-05-07T20:23:43.6883860Z     - conda-libmamba-solver
2025-05-07T20:23:43.6884101Z     - libarchive
2025-05-07T20:23:43.6884304Z     - libmamba
2025-05-07T20:23:43.6884507Z     - libmambapy
2025-05-07T20:23:43.6884630Z 
2025-05-07T20:23:43.6884634Z 
2025-05-07T20:23:43.6884775Z The following packages will be downloaded:
2025-05-07T20:23:43.6884998Z 
2025-05-07T20:23:43.6885107Z     package                    |            build
2025-05-07T20:23:43.6885425Z     ---------------------------|-----------------
2025-05-07T20:23:43.6885831Z     ca-certificates-2025.4.26  |       hbd8a1cb_0         149 KB  conda-forge
2025-05-07T20:23:43.6886300Z     certifi-2025.4.26          |     pyhd8ed1ab_0         154 KB  conda-forge
2025-05-07T20:23:43.6886730Z     conda-25.3.1               |  py313h78bf25f_1         1.1 MB  conda-forge
2025-05-07T20:23:43.6887196Z     conda-libmamba-solver-25.4.0|     pyhd8ed1ab_0          41 KB  conda-forge
2025-05-07T20:23:43.6887630Z     ------------------------------------------------------------
2025-05-07T20:23:43.6887965Z                                            Total:         1.4 MB
2025-05-07T20:23:43.6888175Z 
2025-05-07T20:23:43.6888282Z The following packages will be UPDATED:
2025-05-07T20:23:43.6888483Z 
2025-05-07T20:23:43.6893308Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:23:43.6894084Z   conda              pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 
2025-05-07T20:23:43.6894472Z 
2025-05-07T20:23:43.6894693Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:23:43.6895013Z 
2025-05-07T20:23:43.6895326Z   certifi            pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 
2025-05-07T20:23:43.6896119Z   conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 
2025-05-07T20:23:43.6896598Z 
2025-05-07T20:23:43.6896602Z 
2025-05-07T20:23:43.6896606Z 
2025-05-07T20:23:43.6896754Z Downloading and Extracting Packages: ...working...
2025-05-07T20:23:43.6897119Z conda-25.3.1         | 1.1 MB    |            |   0% 
2025-05-07T20:23:43.6897346Z 
2025-05-07T20:23:43.6905480Z certifi-2025.4.26    | 154 KB    |            |   0% [A
2025-05-07T20:23:43.6905737Z 
2025-05-07T20:23:43.6905741Z 
2025-05-07T20:23:43.6914489Z ca-certificates-2025 | 149 KB    |            |   0% [A[A
2025-05-07T20:23:43.6914746Z 
2025-05-07T20:23:43.6914763Z 
2025-05-07T20:23:43.6914767Z 
2025-05-07T20:23:43.7401981Z conda-libmamba-solve | 41 KB     |            |   0% [A[A[A
2025-05-07T20:23:43.7402261Z 
2025-05-07T20:23:43.7404229Z 
2025-05-07T20:23:43.7479701Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:23:43.7479964Z 
2025-05-07T20:23:43.7479968Z 
2025-05-07T20:23:43.7849430Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:23:43.7849685Z 
2025-05-07T20:23:43.7849689Z 
2025-05-07T20:23:43.7850264Z 
2025-05-07T20:23:43.7876118Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:23:43.7877012Z 
2025-05-07T20:23:43.7913833Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:23:43.8055609Z conda-25.3.1         | 1.1 MB    | ######4    |  65% 
2025-05-07T20:23:43.8055853Z 
2025-05-07T20:23:43.8055858Z 
2025-05-07T20:23:43.8055862Z 
2025-05-07T20:23:43.8056688Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:23:43.8057174Z 
2025-05-07T20:23:43.8057183Z 
2025-05-07T20:23:43.8057187Z 
2025-05-07T20:23:43.8065001Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:23:43.8068956Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:23:43.8069241Z 
2025-05-07T20:23:43.8071088Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:23:43.8071335Z 
2025-05-07T20:23:43.9166188Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:23:43.9171677Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:23:43.9172072Z                                                      
2025-05-07T20:23:43.9172269Z 
2025-05-07T20:23:43.9172460Z                                                      [A
2025-05-07T20:23:43.9172702Z 
2025-05-07T20:23:43.9172708Z 
2025-05-07T20:23:43.9172957Z                                                      [A[A
2025-05-07T20:23:43.9173260Z 
2025-05-07T20:23:43.9173266Z 
2025-05-07T20:23:43.9173271Z 
2025-05-07T20:23:43.9173494Z                                                      [A[A[A done
2025-05-07T20:23:44.0178057Z Preparing transaction: \ done
2025-05-07T20:23:44.1184047Z Verifying transaction: / done
2025-05-07T20:23:45.4203010Z Executing transaction: \ | / - \ | / - \ | / - \ done
2025-05-07T20:23:47.1263042Z [SETUP] Updating Miniconda base packages ...
2025-05-07T20:23:47.1287718Z [EXEC] [ATTEMPT 0/3]    + conda update -n base -c defaults --update-deps -y conda
2025-05-07T20:23:48.0658478Z Channels:
2025-05-07T20:23:48.0658867Z  - defaults
2025-05-07T20:23:48.0659230Z Platform: linux-64
2025-05-07T20:23:49.2879242Z Collecting package metadata (repodata.json): - \ | / - \ | done
2025-05-07T20:23:49.4049347Z Solving environment: - \ Channels:
2025-05-07T20:23:49.4049657Z  - defaults
2025-05-07T20:23:49.4050018Z Platform: linux-64
2025-05-07T20:23:49.6971772Z Collecting package metadata (repodata.json): / - \ done
2025-05-07T20:23:49.9127057Z Solving environment: / - \ | done
2025-05-07T20:23:49.9954399Z done
2025-05-07T20:23:50.0619482Z 
2025-05-07T20:23:50.0619760Z ## Package Plan ##
2025-05-07T20:23:50.0620003Z 
2025-05-07T20:23:50.0620151Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:23:50.0620409Z 
2025-05-07T20:23:50.0620501Z   added / updated specs:
2025-05-07T20:23:50.0620746Z     - conda
2025-05-07T20:23:50.0620858Z 
2025-05-07T20:23:50.0620862Z 
2025-05-07T20:23:50.0620983Z The following packages will be downloaded:
2025-05-07T20:23:50.0621192Z 
2025-05-07T20:23:50.0621303Z     package                    |            build
2025-05-07T20:23:50.0621612Z     ---------------------------|-----------------
2025-05-07T20:23:50.0621946Z     pip-25.1                   |     pyhc872135_2         1.3 MB
2025-05-07T20:23:50.0622565Z     tzdata-2025b               |       h04d1e81_0         116 KB
2025-05-07T20:23:50.0623000Z     ------------------------------------------------------------
2025-05-07T20:23:50.0623366Z                                            Total:         1.4 MB
2025-05-07T20:23:50.0623576Z 
2025-05-07T20:23:50.0623688Z The following packages will be UPDATED:
2025-05-07T20:23:50.0623892Z 
2025-05-07T20:23:50.0624178Z   pip                pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:23:50.0624672Z   tzdata                                   2025a-h04d1e81_0 --> 2025b-h04d1e81_0 
2025-05-07T20:23:50.0624915Z 
2025-05-07T20:23:50.0624918Z 
2025-05-07T20:23:50.0624929Z 
2025-05-07T20:23:50.0625065Z Downloading and Extracting Packages: ...working...
2025-05-07T20:23:50.0625423Z pip-25.1             | 1.3 MB    |            |   0% 
2025-05-07T20:23:50.0625630Z 
2025-05-07T20:23:50.1137293Z tzdata-2025b         | 116 KB    |            |   0% [A
2025-05-07T20:23:50.2116060Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:23:50.2116327Z 
2025-05-07T20:23:50.2159745Z tzdata-2025b         | 116 KB    | #3         |  14% [A
2025-05-07T20:23:50.2160481Z 
2025-05-07T20:23:50.2194904Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:23:50.2197953Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:23:50.3352011Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:23:50.3352389Z 
2025-05-07T20:23:50.3353141Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:23:50.3353490Z 
2025-05-07T20:23:50.3357725Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:23:50.3358060Z                                                      
2025-05-07T20:23:50.3358251Z 
2025-05-07T20:23:50.3358417Z                                                      [A done
2025-05-07T20:23:50.4361114Z Preparing transaction: - done
2025-05-07T20:23:50.5367065Z Verifying transaction: | done
2025-05-07T20:23:52.5441943Z Executing transaction: - \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:23:53.1513493Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:23:53.1517985Z + conda clean --packages --tarball -y
2025-05-07T20:23:53.1518195Z 
2025-05-07T20:23:54.1502565Z Will remove 99 (117.8 MB) tarball(s).
2025-05-07T20:23:54.1503019Z Will remove 11 (16.0 MB) package(s).
2025-05-07T20:23:54.2140661Z 
2025-05-07T20:23:54.2148794Z + conda clean --all -y
2025-05-07T20:23:54.2149000Z 
2025-05-07T20:23:54.8089602Z There are no unused tarball(s) to remove.
2025-05-07T20:23:54.8089928Z Will remove 1 index cache(s).
2025-05-07T20:23:54.8090209Z There are no unused package(s) to remove.
2025-05-07T20:23:54.8090505Z There are no tempfile(s) to remove.
2025-05-07T20:23:54.8090794Z There are no logfile(s) to remove.
2025-05-07T20:23:54.8773910Z 
2025-05-07T20:23:54.8778573Z + conda info
2025-05-07T20:23:54.8778734Z 
2025-05-07T20:23:55.6667852Z 
2025-05-07T20:23:55.6668398Z      active environment : base
2025-05-07T20:23:55.6668762Z     active env location : /home/ec2-user/miniconda
2025-05-07T20:23:55.6669092Z             shell level : 1
2025-05-07T20:23:55.6669370Z        user config file : /home/ec2-user/.condarc
2025-05-07T20:23:55.6669858Z  populated config files : /home/ec2-user/miniconda/.condarc
2025-05-07T20:23:55.6670212Z           conda version : 25.3.1
2025-05-07T20:23:55.6670501Z     conda-build version : not installed
2025-05-07T20:23:55.6670805Z          python version : 3.13.2.final.0
2025-05-07T20:23:55.6671101Z                  solver : libmamba (default)
2025-05-07T20:23:55.6671448Z        virtual packages : __archspec=1=zen2
2025-05-07T20:23:55.6671760Z                           __conda=25.3.1=0
2025-05-07T20:23:55.6672032Z                           __cuda=12.8=0
2025-05-07T20:23:55.6672307Z                           __glibc=2.34=0
2025-05-07T20:23:55.6672600Z                           __linux=6.1.130=0
2025-05-07T20:23:55.6672878Z                           __unix=0=0
2025-05-07T20:23:55.6673548Z        base environment : /home/ec2-user/miniconda  (writable)
2025-05-07T20:23:55.6673961Z       conda av data dir : /home/ec2-user/miniconda/etc/conda
2025-05-07T20:23:55.6674315Z   conda av metadata url : None
2025-05-07T20:23:55.6674674Z            channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
2025-05-07T20:23:55.6675106Z                           https://repo.anaconda.com/pkgs/main/noarch
2025-05-07T20:23:55.6675492Z                           https://repo.anaconda.com/pkgs/r/linux-64
2025-05-07T20:23:55.6675872Z                           https://repo.anaconda.com/pkgs/r/noarch
2025-05-07T20:23:55.6676233Z           package cache : /home/ec2-user/miniconda/pkgs
2025-05-07T20:23:55.6676576Z                           /home/ec2-user/.conda/pkgs
2025-05-07T20:23:55.6676916Z        envs directories : /home/ec2-user/miniconda/envs
2025-05-07T20:23:55.6677246Z                           /home/ec2-user/.conda/envs
2025-05-07T20:23:55.6677550Z                platform : linux-64
2025-05-07T20:23:55.6678384Z              user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/.
2025-05-07T20:23:55.6679356Z                 UID:GID : 1000:1000
2025-05-07T20:23:55.6679626Z              netrc file : None
2025-05-07T20:23:55.6679888Z            offline mode : False
2025-05-07T20:23:55.6680055Z 
2025-05-07T20:23:55.7321783Z 
2025-05-07T20:23:55.7322020Z [SETUP] Exporting Miniconda variables ...
2025-05-07T20:23:55.7322756Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_589e50b1-f869-4197-9b6e-dcb1911e9ee8 ...
2025-05-07T20:23:55.7324338Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda
2025-05-07T20:23:55.7404194Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.9
2025-05-07T20:23:55.7404688Z [36;1m. $PRELUDE; create_conda_environment $BUILD_ENV 3.9[0m
2025-05-07T20:23:55.7421226Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:55.7421568Z env:
2025-05-07T20:23:55.7421779Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:55.7422096Z   BUILD_ENV: build_binary
2025-05-07T20:23:55.7422347Z   BUILD_TARGET: genai
2025-05-07T20:23:55.7422578Z   BUILD_VARIANT: cuda
2025-05-07T20:23:55.7422805Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:55.7423061Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:55.7423364Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:55.7423687Z ##[endgroup]
2025-05-07T20:23:56.0804215Z ################################################################################
2025-05-07T20:23:56.0804604Z # Create Conda Environment
2025-05-07T20:23:56.0804844Z #
2025-05-07T20:23:56.0820809Z # [2025-05-07T20:23:56.081Z] + create_conda_environment build_binary 3.9
2025-05-07T20:23:56.0821281Z ################################################################################
2025-05-07T20:23:56.0821542Z 
2025-05-07T20:23:56.0837693Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:56.1753756Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:56.1754195Z [SETUP] Listing existing Conda environments ...
2025-05-07T20:23:56.1754561Z + conda info --envs
2025-05-07T20:23:56.1754766Z 
2025-05-07T20:23:56.9546038Z 
2025-05-07T20:23:56.9546703Z # conda environments:
2025-05-07T20:23:56.9546989Z #
2025-05-07T20:23:56.9547217Z base                   /home/ec2-user/miniconda
2025-05-07T20:23:56.9547445Z 
2025-05-07T20:23:57.0194744Z 
2025-05-07T20:23:57.0195148Z [SETUP] Deleting the prefix directory if it exists ...
2025-05-07T20:23:58.6480869Z + rm -rf /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:23:58.6481150Z 
2025-05-07T20:23:58.6496216Z 
2025-05-07T20:23:58.6505879Z [SETUP] Creating new Conda environment (Python 3.9) ...
2025-05-07T20:23:58.6528496Z [EXEC] [ATTEMPT 0/3]    + conda create -y -n build_binary python=3.9
2025-05-07T20:23:59.4306612Z Channels:
2025-05-07T20:23:59.4306849Z  - defaults
2025-05-07T20:23:59.4307059Z Platform: linux-64
2025-05-07T20:24:00.8902186Z Collecting package metadata (repodata.json): - \ | / - \ | / - done
2025-05-07T20:24:00.9907899Z Solving environment: | done
2025-05-07T20:24:01.0201146Z 
2025-05-07T20:24:01.0201613Z ## Package Plan ##
2025-05-07T20:24:01.0201807Z 
2025-05-07T20:24:01.0202046Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:01.0202419Z 
2025-05-07T20:24:01.0202530Z   added / updated specs:
2025-05-07T20:24:01.0202820Z     - python=3.9
2025-05-07T20:24:01.0202962Z 
2025-05-07T20:24:01.0202968Z 
2025-05-07T20:24:01.0203088Z The following packages will be downloaded:
2025-05-07T20:24:01.0203304Z 
2025-05-07T20:24:01.0203458Z     package                    |            build
2025-05-07T20:24:01.0204032Z     ---------------------------|-----------------
2025-05-07T20:24:01.0204397Z     _libgcc_mutex-0.1          |             main           3 KB
2025-05-07T20:24:01.0204798Z     _openmp_mutex-5.1          |            1_gnu          21 KB
2025-05-07T20:24:01.0205339Z     ca-certificates-2025.2.25  |       h06a4308_0         129 KB
2025-05-07T20:24:01.0206259Z     python-3.9.21              |       he870216_1        25.1 MB
2025-05-07T20:24:01.0206657Z     setuptools-78.1.1          |   py39h06a4308_0         1.7 MB
2025-05-07T20:24:01.0207056Z     wheel-0.45.1               |   py39h06a4308_0         114 KB
2025-05-07T20:24:01.0207412Z     ------------------------------------------------------------
2025-05-07T20:24:01.0207746Z                                            Total:        27.1 MB
2025-05-07T20:24:01.0207952Z 
2025-05-07T20:24:01.0208085Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:01.0208303Z 
2025-05-07T20:24:01.0208737Z   _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main 
2025-05-07T20:24:01.0209180Z   _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 
2025-05-07T20:24:01.0209690Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 
2025-05-07T20:24:01.0210231Z   ld_impl_linux-64   pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 
2025-05-07T20:24:01.0210683Z   libffi             pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 
2025-05-07T20:24:01.0211120Z   libgcc-ng          pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 
2025-05-07T20:24:01.0211554Z   libgomp            pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 
2025-05-07T20:24:01.0212014Z   libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 
2025-05-07T20:24:01.0212606Z   ncurses            pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 
2025-05-07T20:24:01.0213186Z   openssl            pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 
2025-05-07T20:24:01.0213595Z   pip                pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:24:01.0214004Z   python             pkgs/main/linux-64::python-3.9.21-he870216_1 
2025-05-07T20:24:01.0214424Z   readline           pkgs/main/linux-64::readline-8.2-h5eee18b_0 
2025-05-07T20:24:01.0214891Z   setuptools         pkgs/main/linux-64::setuptools-78.1.1-py39h06a4308_0 
2025-05-07T20:24:01.0215351Z   sqlite             pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 
2025-05-07T20:24:01.0215747Z   tk                 pkgs/main/linux-64::tk-8.6.14-h39e8969_0 
2025-05-07T20:24:01.0216121Z   tzdata             pkgs/main/noarch::tzdata-2025b-h04d1e81_0 
2025-05-07T20:24:01.0216532Z   wheel              pkgs/main/linux-64::wheel-0.45.1-py39h06a4308_0 
2025-05-07T20:24:01.0216921Z   xz                 pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 
2025-05-07T20:24:01.0217283Z   zlib               pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 
2025-05-07T20:24:01.0217536Z 
2025-05-07T20:24:01.0217541Z 
2025-05-07T20:24:01.0217545Z 
2025-05-07T20:24:01.0217691Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:01.0218075Z python-3.9.21        | 25.1 MB   |            |   0% 
2025-05-07T20:24:01.0218300Z 
2025-05-07T20:24:01.0218681Z setuptools-78.1.1    | 1.7 MB    |            |   0% [A
2025-05-07T20:24:01.0218921Z 
2025-05-07T20:24:01.0218925Z 
2025-05-07T20:24:01.0233820Z ca-certificates-2025 | 129 KB    |            |   0% [A[A
2025-05-07T20:24:01.0234096Z 
2025-05-07T20:24:01.0234100Z 
2025-05-07T20:24:01.0236481Z 
2025-05-07T20:24:01.0247913Z wheel-0.45.1         | 114 KB    |            |   0% [A[A[A
2025-05-07T20:24:01.0248162Z 
2025-05-07T20:24:01.0248165Z 
2025-05-07T20:24:01.0248169Z 
2025-05-07T20:24:01.0248173Z 
2025-05-07T20:24:01.0255378Z _openmp_mutex-5.1    | 21 KB     |            |   0% [A[A[A[A
2025-05-07T20:24:01.0255657Z 
2025-05-07T20:24:01.0255661Z 
2025-05-07T20:24:01.0255665Z 
2025-05-07T20:24:01.0255669Z 
2025-05-07T20:24:01.0255672Z 
2025-05-07T20:24:01.0580847Z _libgcc_mutex-0.1    | 3 KB      |            |   0% [A[A[A[A[A
2025-05-07T20:24:01.0581141Z 
2025-05-07T20:24:01.0581145Z 
2025-05-07T20:24:01.0581149Z 
2025-05-07T20:24:01.0582748Z 
2025-05-07T20:24:01.0740245Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:01.0740530Z 
2025-05-07T20:24:01.0740534Z 
2025-05-07T20:24:01.0740538Z 
2025-05-07T20:24:01.0740542Z 
2025-05-07T20:24:01.0744354Z 
2025-05-07T20:24:01.0873276Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:01.0873799Z 
2025-05-07T20:24:01.0873803Z 
2025-05-07T20:24:01.0875953Z 
2025-05-07T20:24:01.1081162Z wheel-0.45.1         | 114 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:01.1081421Z 
2025-05-07T20:24:01.1081425Z 
2025-05-07T20:24:01.1081429Z 
2025-05-07T20:24:01.1081433Z 
2025-05-07T20:24:01.1084552Z 
2025-05-07T20:24:01.1179288Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:01.1179668Z 
2025-05-07T20:24:01.1181475Z 
2025-05-07T20:24:01.1204917Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A
2025-05-07T20:24:01.1632770Z python-3.9.21        | 25.1 MB   | #2         |  12% 
2025-05-07T20:24:01.1633128Z 
2025-05-07T20:24:01.1633473Z setuptools-78.1.1    | 1.7 MB    | ########## | 100% [A
2025-05-07T20:24:01.1636766Z 
2025-05-07T20:24:01.1679367Z setuptools-78.1.1    | 1.7 MB    | ########## | 100% [A
2025-05-07T20:24:01.1679714Z 
2025-05-07T20:24:01.1679720Z 
2025-05-07T20:24:01.1685583Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A
2025-05-07T20:24:01.1685949Z 
2025-05-07T20:24:01.1685954Z 
2025-05-07T20:24:01.2038872Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A
2025-05-07T20:24:01.2039251Z 
2025-05-07T20:24:01.2039257Z 
2025-05-07T20:24:01.2039262Z 
2025-05-07T20:24:01.2041922Z 
2025-05-07T20:24:01.2045890Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:01.2046276Z 
2025-05-07T20:24:01.2046282Z 
2025-05-07T20:24:01.2046287Z 
2025-05-07T20:24:01.2046292Z 
2025-05-07T20:24:01.2205641Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:01.2279146Z python-3.9.21        | 25.1 MB   | ###1       |  31% 
2025-05-07T20:24:01.2279453Z 
2025-05-07T20:24:01.2279468Z 
2025-05-07T20:24:01.2281029Z 
2025-05-07T20:24:01.2285641Z wheel-0.45.1         | 114 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:01.2285883Z 
2025-05-07T20:24:01.2286104Z 
2025-05-07T20:24:01.2286213Z 
2025-05-07T20:24:01.3210705Z wheel-0.45.1         | 114 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:01.3746192Z python-3.9.21        | 25.1 MB   | #########8 |  98% 
2025-05-07T20:24:01.5591806Z python-3.9.21        | 25.1 MB   | ########## | 100% 
2025-05-07T20:24:01.5592057Z 
2025-05-07T20:24:02.0327954Z setuptools-78.1.1    | 1.7 MB    | ########## | 100% [A
2025-05-07T20:24:02.0334772Z python-3.9.21        | 25.1 MB   | ########## | 100% 
2025-05-07T20:24:02.0335120Z                                                      
2025-05-07T20:24:02.0335321Z 
2025-05-07T20:24:02.0335505Z                                                      [A
2025-05-07T20:24:02.0335715Z 
2025-05-07T20:24:02.0335719Z 
2025-05-07T20:24:02.0335884Z                                                      [A[A
2025-05-07T20:24:02.0336096Z 
2025-05-07T20:24:02.0336100Z 
2025-05-07T20:24:02.0336105Z 
2025-05-07T20:24:02.0336271Z                                                      [A[A[A
2025-05-07T20:24:02.0336487Z 
2025-05-07T20:24:02.0336497Z 
2025-05-07T20:24:02.0336501Z 
2025-05-07T20:24:02.0336505Z 
2025-05-07T20:24:02.0336682Z                                                      [A[A[A[A
2025-05-07T20:24:02.0336898Z 
2025-05-07T20:24:02.0336902Z 
2025-05-07T20:24:02.0336905Z 
2025-05-07T20:24:02.0336909Z 
2025-05-07T20:24:02.0336913Z 
2025-05-07T20:24:02.0337091Z                                                      [A[A[A[A[A done
2025-05-07T20:24:02.2442660Z Preparing transaction: - \ done
2025-05-07T20:24:03.3854561Z Verifying transaction: / - \ | / - \ | / - \ done
2025-05-07T20:24:05.6031351Z Executing transaction: / - \ | / - \ | / - \ | / - \ | / - \ | / - done
2025-05-07T20:24:05.6528335Z #
2025-05-07T20:24:05.6528572Z # To activate this environment, use
2025-05-07T20:24:05.6528871Z #
2025-05-07T20:24:05.6529081Z #     $ conda activate build_binary
2025-05-07T20:24:05.6529348Z #
2025-05-07T20:24:05.6529554Z # To deactivate an active environment, use
2025-05-07T20:24:05.6530120Z #
2025-05-07T20:24:05.6530311Z #     $ conda deactivate
2025-05-07T20:24:05.6530465Z 
2025-05-07T20:24:05.7565187Z [SETUP] Upgrading PIP to latest ...
2025-05-07T20:24:05.7586964Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --upgrade pip
2025-05-07T20:24:08.5698528Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (25.1)
2025-05-07T20:24:08.5699144Z Collecting pip
2025-05-07T20:24:08.5699468Z   Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
2025-05-07T20:24:08.5699890Z Downloading pip-25.1.1-py3-none-any.whl (1.8 MB)
2025-05-07T20:24:08.5701108Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 102.3 MB/s eta 0:00:00
2025-05-07T20:24:08.5701493Z Installing collected packages: pip
2025-05-07T20:24:08.5701793Z   Attempting uninstall: pip
2025-05-07T20:24:08.5702075Z     Found existing installation: pip 25.1
2025-05-07T20:24:08.5702386Z     Uninstalling pip-25.1:
2025-05-07T20:24:08.5702688Z       Successfully uninstalled pip-25.1
2025-05-07T20:24:08.5703006Z Successfully installed pip-25.1.1
2025-05-07T20:24:08.5703194Z 
2025-05-07T20:24:08.6343055Z [SETUP] Upgrading pyOpenSSL ...
2025-05-07T20:24:08.6366710Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0
2025-05-07T20:24:09.5160748Z Channels:
2025-05-07T20:24:09.5160999Z  - conda-forge
2025-05-07T20:24:09.5161223Z Platform: linux-64
2025-05-07T20:24:19.9941195Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - done
2025-05-07T20:24:21.5047694Z Solving environment: | / - \ | done
2025-05-07T20:24:21.5702603Z 
2025-05-07T20:24:21.5703171Z ## Package Plan ##
2025-05-07T20:24:21.5703603Z 
2025-05-07T20:24:21.5704411Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:21.5705011Z 
2025-05-07T20:24:21.5705198Z   added / updated specs:
2025-05-07T20:24:21.5705708Z     - pyopenssl[version='>22.1.0']
2025-05-07T20:24:21.5706103Z 
2025-05-07T20:24:21.5706113Z 
2025-05-07T20:24:21.5706343Z The following packages will be downloaded:
2025-05-07T20:24:21.5706765Z 
2025-05-07T20:24:21.5706984Z     package                    |            build
2025-05-07T20:24:21.5707433Z     ---------------------------|-----------------
2025-05-07T20:24:21.5707806Z     cffi-1.17.1                |   py39h15c3d72_0         236 KB  conda-forge
2025-05-07T20:24:21.5708244Z     cryptography-44.0.3        |   py39h7170ec2_0         1.5 MB  conda-forge
2025-05-07T20:24:21.5708689Z     libgcc-15.1.0              |       h767d61c_2         810 KB  conda-forge
2025-05-07T20:24:21.5709112Z     libgcc-ng-15.1.0           |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:24:21.5709523Z     libgomp-15.1.0             |       h767d61c_2         442 KB  conda-forge
2025-05-07T20:24:21.5710006Z     openssl-3.5.0              |       h7b32b05_1         3.0 MB  conda-forge
2025-05-07T20:24:21.5710421Z     pycparser-2.22             |     pyh29332c3_1         108 KB  conda-forge
2025-05-07T20:24:21.5710858Z     pyopenssl-25.0.0           |     pyhd8ed1ab_0         120 KB  conda-forge
2025-05-07T20:24:21.5711277Z     python_abi-3.9             |           2_cp39           4 KB  conda-forge
2025-05-07T20:24:21.5711723Z     typing-extensions-4.13.2   |       h0e9735f_0          88 KB  conda-forge
2025-05-07T20:24:21.5712306Z     typing_extensions-4.13.2   |     pyh29332c3_0          51 KB  conda-forge
2025-05-07T20:24:21.5712868Z     ------------------------------------------------------------
2025-05-07T20:24:21.5713222Z                                            Total:         6.3 MB
2025-05-07T20:24:21.5713438Z 
2025-05-07T20:24:21.5713566Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:21.5713816Z 
2025-05-07T20:24:21.5714084Z   cffi               conda-forge/linux-64::cffi-1.17.1-py39h15c3d72_0 
2025-05-07T20:24:21.5714771Z   cryptography       conda-forge/linux-64::cryptography-44.0.3-py39h7170ec2_0 
2025-05-07T20:24:21.5715672Z   libgcc             conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 
2025-05-07T20:24:21.5716125Z   pycparser          conda-forge/noarch::pycparser-2.22-pyh29332c3_1 
2025-05-07T20:24:21.5716603Z   pyopenssl          conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 
2025-05-07T20:24:21.5717054Z   python_abi         conda-forge/linux-64::python_abi-3.9-2_cp39 
2025-05-07T20:24:21.5717565Z   typing-extensions  conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 
2025-05-07T20:24:21.5718196Z   typing_extensions  conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 
2025-05-07T20:24:21.5718530Z 
2025-05-07T20:24:21.5718822Z The following packages will be UPDATED:
2025-05-07T20:24:21.5719027Z 
2025-05-07T20:24:21.5719665Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:24:21.5720515Z   libgcc-ng          pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 
2025-05-07T20:24:21.5721166Z   libgomp              pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 
2025-05-07T20:24:21.5721792Z   openssl              pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 
2025-05-07T20:24:21.5722149Z 
2025-05-07T20:24:21.5722153Z 
2025-05-07T20:24:21.5722157Z 
2025-05-07T20:24:21.5722309Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:21.5722673Z openssl-3.5.0        | 3.0 MB    |            |   0% 
2025-05-07T20:24:21.5722911Z 
2025-05-07T20:24:21.5723292Z cryptography-44.0.3  | 1.5 MB    |            |   0% [A
2025-05-07T20:24:21.5723543Z 
2025-05-07T20:24:21.5723554Z 
2025-05-07T20:24:21.5724645Z libgcc-15.1.0        | 810 KB    |            |   0% [A[A
2025-05-07T20:24:21.5724967Z 
2025-05-07T20:24:21.5724972Z 
2025-05-07T20:24:21.5724981Z 
2025-05-07T20:24:21.5740832Z libgomp-15.1.0       | 442 KB    |            |   0% [A[A[A
2025-05-07T20:24:21.5741173Z 
2025-05-07T20:24:21.5741179Z 
2025-05-07T20:24:21.5741192Z 
2025-05-07T20:24:21.5741198Z 
2025-05-07T20:24:21.5762759Z cffi-1.17.1          | 236 KB    |            |   0% [A[A[A[A
2025-05-07T20:24:21.5763085Z 
2025-05-07T20:24:21.5763091Z 
2025-05-07T20:24:21.5763096Z 
2025-05-07T20:24:21.5763101Z 
2025-05-07T20:24:21.5773086Z 
2025-05-07T20:24:21.5781704Z pyopenssl-25.0.0     | 120 KB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:21.5782074Z 
2025-05-07T20:24:21.5782080Z 
2025-05-07T20:24:21.5782085Z 
2025-05-07T20:24:21.5782090Z 
2025-05-07T20:24:21.5782096Z 
2025-05-07T20:24:21.5782114Z 
2025-05-07T20:24:21.5783536Z pycparser-2.22       | 108 KB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:24:21.5783910Z 
2025-05-07T20:24:21.5783916Z 
2025-05-07T20:24:21.5783921Z 
2025-05-07T20:24:21.5783938Z 
2025-05-07T20:24:21.5783944Z 
2025-05-07T20:24:21.5783949Z 
2025-05-07T20:24:21.5783954Z 
2025-05-07T20:24:21.5790705Z typing-extensions-4. | 88 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:24:21.5791123Z 
2025-05-07T20:24:21.5791137Z 
2025-05-07T20:24:21.5791143Z 
2025-05-07T20:24:21.5791148Z 
2025-05-07T20:24:21.5791154Z 
2025-05-07T20:24:21.5791159Z 
2025-05-07T20:24:21.5791164Z 
2025-05-07T20:24:21.5791169Z 
2025-05-07T20:24:21.5792170Z typing_extensions-4. | 51 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:24:21.5792577Z 
2025-05-07T20:24:21.5792582Z 
2025-05-07T20:24:21.5792587Z 
2025-05-07T20:24:21.5792592Z 
2025-05-07T20:24:21.5792598Z 
2025-05-07T20:24:21.5792603Z 
2025-05-07T20:24:21.5792612Z 
2025-05-07T20:24:21.5792618Z 
2025-05-07T20:24:21.5792623Z 
2025-05-07T20:24:21.5793727Z libgcc-ng-15.1.0     | 34 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.5794103Z 
2025-05-07T20:24:21.5794109Z 
2025-05-07T20:24:21.5794115Z 
2025-05-07T20:24:21.5794120Z 
2025-05-07T20:24:21.5794125Z 
2025-05-07T20:24:21.5794137Z 
2025-05-07T20:24:21.5794143Z 
2025-05-07T20:24:21.5794147Z 
2025-05-07T20:24:21.5794152Z 
2025-05-07T20:24:21.5794157Z 
2025-05-07T20:24:21.6291388Z python_abi-3.9       | 4 KB      |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.6291763Z 
2025-05-07T20:24:21.6293847Z 
2025-05-07T20:24:21.6293853Z 
2025-05-07T20:24:21.6293858Z 
2025-05-07T20:24:21.6446427Z cffi-1.17.1          | 236 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:21.6446765Z 
2025-05-07T20:24:21.6446770Z 
2025-05-07T20:24:21.6446776Z 
2025-05-07T20:24:21.6714546Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:21.6715921Z 
2025-05-07T20:24:21.6715932Z 
2025-05-07T20:24:21.6715943Z 
2025-05-07T20:24:21.6715953Z 
2025-05-07T20:24:21.6715963Z 
2025-05-07T20:24:21.6727700Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:21.6737929Z 
2025-05-07T20:24:21.6757440Z cryptography-44.0.3  | 1.5 MB    | #4         |  15% [A
2025-05-07T20:24:21.6757726Z 
2025-05-07T20:24:21.6757730Z 
2025-05-07T20:24:21.6810755Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:21.6901214Z openssl-3.5.0        | 3.0 MB    |            |   1% 
2025-05-07T20:24:21.6901481Z 
2025-05-07T20:24:21.6901796Z 
2025-05-07T20:24:21.6901801Z 
2025-05-07T20:24:21.6907237Z 
2025-05-07T20:24:21.6916389Z cffi-1.17.1          | 236 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:21.6916639Z 
2025-05-07T20:24:21.6916643Z 
2025-05-07T20:24:21.6916647Z 
2025-05-07T20:24:21.6919396Z 
2025-05-07T20:24:21.6923056Z cffi-1.17.1          | 236 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:21.6923299Z 
2025-05-07T20:24:21.6923302Z 
2025-05-07T20:24:21.6923306Z 
2025-05-07T20:24:21.6923310Z 
2025-05-07T20:24:21.6923313Z 
2025-05-07T20:24:21.6924818Z 
2025-05-07T20:24:21.7019176Z pycparser-2.22       | 108 KB    | #4         |  15% [A[A[A[A[A[A
2025-05-07T20:24:21.7019456Z 
2025-05-07T20:24:21.7019460Z 
2025-05-07T20:24:21.7047478Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:21.7047833Z 
2025-05-07T20:24:21.7047839Z 
2025-05-07T20:24:21.7047845Z 
2025-05-07T20:24:21.7047862Z 
2025-05-07T20:24:21.7047865Z 
2025-05-07T20:24:21.7047877Z 
2025-05-07T20:24:21.7180243Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:21.7180630Z 
2025-05-07T20:24:21.7180634Z 
2025-05-07T20:24:21.7180638Z 
2025-05-07T20:24:21.7180642Z 
2025-05-07T20:24:21.7180652Z 
2025-05-07T20:24:21.7180656Z 
2025-05-07T20:24:21.7181757Z 
2025-05-07T20:24:21.7235361Z typing-extensions-4. | 88 KB     | #8         |  18% [A[A[A[A[A[A[A
2025-05-07T20:24:21.7235710Z 
2025-05-07T20:24:21.7235715Z 
2025-05-07T20:24:21.7235719Z 
2025-05-07T20:24:21.7235722Z 
2025-05-07T20:24:21.7235726Z 
2025-05-07T20:24:21.7235740Z 
2025-05-07T20:24:21.7235744Z 
2025-05-07T20:24:21.7574166Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:21.7574603Z 
2025-05-07T20:24:21.7574609Z 
2025-05-07T20:24:21.7574613Z 
2025-05-07T20:24:21.7574617Z 
2025-05-07T20:24:21.7574620Z 
2025-05-07T20:24:21.7574624Z 
2025-05-07T20:24:21.7574628Z 
2025-05-07T20:24:21.7574640Z 
2025-05-07T20:24:21.7577797Z 
2025-05-07T20:24:21.7617112Z libgcc-ng-15.1.0     | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.7617443Z 
2025-05-07T20:24:21.7617447Z 
2025-05-07T20:24:21.7617451Z 
2025-05-07T20:24:21.7617454Z 
2025-05-07T20:24:21.7617458Z 
2025-05-07T20:24:21.7617462Z 
2025-05-07T20:24:21.7617465Z 
2025-05-07T20:24:21.7617469Z 
2025-05-07T20:24:21.7617473Z 
2025-05-07T20:24:21.7617476Z 
2025-05-07T20:24:21.7639086Z python_abi-3.9       | 4 KB      | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.7639496Z 
2025-05-07T20:24:21.7639503Z 
2025-05-07T20:24:21.7639520Z 
2025-05-07T20:24:21.7639526Z 
2025-05-07T20:24:21.7639532Z 
2025-05-07T20:24:21.7639537Z 
2025-05-07T20:24:21.7639543Z 
2025-05-07T20:24:21.7639548Z 
2025-05-07T20:24:21.7639554Z 
2025-05-07T20:24:21.7641369Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.7641645Z 
2025-05-07T20:24:21.7641649Z 
2025-05-07T20:24:21.7641870Z 
2025-05-07T20:24:21.7641874Z 
2025-05-07T20:24:21.7641878Z 
2025-05-07T20:24:21.7641881Z 
2025-05-07T20:24:21.7641894Z 
2025-05-07T20:24:21.7641897Z 
2025-05-07T20:24:21.7641901Z 
2025-05-07T20:24:21.7641905Z 
2025-05-07T20:24:21.7650298Z python_abi-3.9       | 4 KB      | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.7650572Z 
2025-05-07T20:24:21.7694402Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:21.7694655Z 
2025-05-07T20:24:21.7694658Z 
2025-05-07T20:24:21.7694662Z 
2025-05-07T20:24:21.7694666Z 
2025-05-07T20:24:21.7696457Z 
2025-05-07T20:24:21.7707091Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:21.7707372Z 
2025-05-07T20:24:21.7707376Z 
2025-05-07T20:24:21.7707380Z 
2025-05-07T20:24:21.7707383Z 
2025-05-07T20:24:21.7707416Z 
2025-05-07T20:24:21.7712197Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:21.7712470Z 
2025-05-07T20:24:21.7712473Z 
2025-05-07T20:24:21.7712487Z 
2025-05-07T20:24:21.7712498Z 
2025-05-07T20:24:21.7712502Z 
2025-05-07T20:24:21.7712506Z 
2025-05-07T20:24:21.7712510Z 
2025-05-07T20:24:21.7712513Z 
2025-05-07T20:24:21.7740901Z typing_extensions-4. | 51 KB     | ###1       |  31% [A[A[A[A[A[A[A[A
2025-05-07T20:24:21.7741203Z 
2025-05-07T20:24:21.7741207Z 
2025-05-07T20:24:21.7741211Z 
2025-05-07T20:24:21.7741214Z 
2025-05-07T20:24:21.7741218Z 
2025-05-07T20:24:21.7741221Z 
2025-05-07T20:24:21.7741225Z 
2025-05-07T20:24:21.7741229Z 
2025-05-07T20:24:21.7840528Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:21.7840831Z 
2025-05-07T20:24:21.7840835Z 
2025-05-07T20:24:21.7842667Z 
2025-05-07T20:24:21.7849324Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:21.7849578Z 
2025-05-07T20:24:21.7849582Z 
2025-05-07T20:24:21.7849586Z 
2025-05-07T20:24:21.8055450Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:21.8056023Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:24:21.8133285Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:24:21.8133542Z 
2025-05-07T20:24:21.8133546Z 
2025-05-07T20:24:21.8133550Z 
2025-05-07T20:24:21.8133554Z 
2025-05-07T20:24:21.8133558Z 
2025-05-07T20:24:21.8133562Z 
2025-05-07T20:24:21.8133565Z 
2025-05-07T20:24:21.8620862Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:21.8621171Z 
2025-05-07T20:24:21.8621176Z 
2025-05-07T20:24:21.8634432Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:21.8634679Z 
2025-05-07T20:24:21.8634693Z 
2025-05-07T20:24:21.8634697Z 
2025-05-07T20:24:21.8634701Z 
2025-05-07T20:24:21.8634704Z 
2025-05-07T20:24:21.8634708Z 
2025-05-07T20:24:21.8634712Z 
2025-05-07T20:24:21.8634715Z 
2025-05-07T20:24:21.8635214Z 
2025-05-07T20:24:21.8640989Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.8641359Z 
2025-05-07T20:24:21.8641375Z 
2025-05-07T20:24:21.8641379Z 
2025-05-07T20:24:21.8641383Z 
2025-05-07T20:24:21.8641387Z 
2025-05-07T20:24:21.8641391Z 
2025-05-07T20:24:21.8641394Z 
2025-05-07T20:24:21.8641398Z 
2025-05-07T20:24:21.8641402Z 
2025-05-07T20:24:21.8801185Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.8801510Z 
2025-05-07T20:24:21.8801514Z 
2025-05-07T20:24:21.8801518Z 
2025-05-07T20:24:21.8801521Z 
2025-05-07T20:24:21.8801525Z 
2025-05-07T20:24:21.8801529Z 
2025-05-07T20:24:21.8801533Z 
2025-05-07T20:24:21.8801536Z 
2025-05-07T20:24:21.8801540Z 
2025-05-07T20:24:21.8801551Z 
2025-05-07T20:24:21.9046276Z python_abi-3.9       | 4 KB      | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.9046654Z 
2025-05-07T20:24:21.9046658Z 
2025-05-07T20:24:21.9046662Z 
2025-05-07T20:24:21.9046672Z 
2025-05-07T20:24:21.9046676Z 
2025-05-07T20:24:21.9046680Z 
2025-05-07T20:24:21.9047559Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:21.9048085Z 
2025-05-07T20:24:21.9048089Z 
2025-05-07T20:24:21.9048104Z 
2025-05-07T20:24:21.9048108Z 
2025-05-07T20:24:21.9048111Z 
2025-05-07T20:24:21.9048115Z 
2025-05-07T20:24:21.9057457Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:21.9057826Z 
2025-05-07T20:24:21.9057837Z 
2025-05-07T20:24:21.9057841Z 
2025-05-07T20:24:21.9057845Z 
2025-05-07T20:24:21.9057848Z 
2025-05-07T20:24:21.9057852Z 
2025-05-07T20:24:21.9057855Z 
2025-05-07T20:24:21.9059252Z 
2025-05-07T20:24:21.9063160Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:21.9063657Z 
2025-05-07T20:24:21.9063665Z 
2025-05-07T20:24:21.9063669Z 
2025-05-07T20:24:21.9063672Z 
2025-05-07T20:24:21.9063676Z 
2025-05-07T20:24:21.9063680Z 
2025-05-07T20:24:21.9063683Z 
2025-05-07T20:24:21.9063690Z 
2025-05-07T20:24:22.0197949Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:22.0275867Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:24:22.0276186Z 
2025-05-07T20:24:22.0276680Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:22.0276951Z 
2025-05-07T20:24:22.0284096Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:22.0284452Z                                                      
2025-05-07T20:24:22.0284648Z 
2025-05-07T20:24:22.0284821Z                                                      [A
2025-05-07T20:24:22.0285021Z 
2025-05-07T20:24:22.0285025Z 
2025-05-07T20:24:22.0285198Z                                                      [A[A
2025-05-07T20:24:22.0285407Z 
2025-05-07T20:24:22.0285420Z 
2025-05-07T20:24:22.0285424Z 
2025-05-07T20:24:22.0285633Z                                                      [A[A[A
2025-05-07T20:24:22.0285933Z 
2025-05-07T20:24:22.0285938Z 
2025-05-07T20:24:22.0285944Z 
2025-05-07T20:24:22.0285949Z 
2025-05-07T20:24:22.0286137Z                                                      [A[A[A[A
2025-05-07T20:24:22.0286412Z 
2025-05-07T20:24:22.0286415Z 
2025-05-07T20:24:22.0286419Z 
2025-05-07T20:24:22.0286423Z 
2025-05-07T20:24:22.0286426Z 
2025-05-07T20:24:22.0286603Z                                                      [A[A[A[A[A
2025-05-07T20:24:22.0286821Z 
2025-05-07T20:24:22.0286825Z 
2025-05-07T20:24:22.0286828Z 
2025-05-07T20:24:22.0286832Z 
2025-05-07T20:24:22.0286836Z 
2025-05-07T20:24:22.0286839Z 
2025-05-07T20:24:22.0287017Z                                                      [A[A[A[A[A[A
2025-05-07T20:24:22.0287256Z 
2025-05-07T20:24:22.0287260Z 
2025-05-07T20:24:22.0287265Z 
2025-05-07T20:24:22.0287269Z 
2025-05-07T20:24:22.0287279Z 
2025-05-07T20:24:22.0287283Z 
2025-05-07T20:24:22.0287287Z 
2025-05-07T20:24:22.0287497Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:24:22.0287720Z 
2025-05-07T20:24:22.0287724Z 
2025-05-07T20:24:22.0287727Z 
2025-05-07T20:24:22.0287731Z 
2025-05-07T20:24:22.0287734Z 
2025-05-07T20:24:22.0287738Z 
2025-05-07T20:24:22.0287746Z 
2025-05-07T20:24:22.0287750Z 
2025-05-07T20:24:22.0287929Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:24:22.0288149Z 
2025-05-07T20:24:22.0288153Z 
2025-05-07T20:24:22.0288156Z 
2025-05-07T20:24:22.0288160Z 
2025-05-07T20:24:22.0288164Z 
2025-05-07T20:24:22.0288167Z 
2025-05-07T20:24:22.0288171Z 
2025-05-07T20:24:22.0288174Z 
2025-05-07T20:24:22.0288178Z 
2025-05-07T20:24:22.0288368Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.0288589Z 
2025-05-07T20:24:22.0288593Z 
2025-05-07T20:24:22.0288596Z 
2025-05-07T20:24:22.0288605Z 
2025-05-07T20:24:22.0288609Z 
2025-05-07T20:24:22.0288612Z 
2025-05-07T20:24:22.0288616Z 
2025-05-07T20:24:22.0288620Z 
2025-05-07T20:24:22.0288623Z 
2025-05-07T20:24:22.0288627Z 
2025-05-07T20:24:22.0288824Z                                                      [A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:24:22.1292614Z Preparing transaction: - done
2025-05-07T20:24:22.2298895Z Verifying transaction: | done
2025-05-07T20:24:23.7321962Z Executing transaction: - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:24:23.9018952Z [SETUP] Testing pyOpenSSL import ...
2025-05-07T20:24:25.6213877Z [CHECK] Python (sub-)package 'OpenSSL' found ...
2025-05-07T20:24:25.6227078Z [SETUP] Installing libxcrypt ...
2025-05-07T20:24:25.6250195Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt
2025-05-07T20:24:26.4935096Z Channels:
2025-05-07T20:24:26.4935400Z  - conda-forge
2025-05-07T20:24:26.4935695Z Platform: linux-64
2025-05-07T20:24:29.7664462Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:30.1334518Z Solving environment: \ done
2025-05-07T20:24:30.1941824Z 
2025-05-07T20:24:30.1942229Z ## Package Plan ##
2025-05-07T20:24:30.1942451Z 
2025-05-07T20:24:30.1942736Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:30.1943150Z 
2025-05-07T20:24:30.1943308Z   added / updated specs:
2025-05-07T20:24:30.1943595Z     - libxcrypt
2025-05-07T20:24:30.1943724Z 
2025-05-07T20:24:30.1943729Z 
2025-05-07T20:24:30.1943854Z The following packages will be downloaded:
2025-05-07T20:24:30.1944069Z 
2025-05-07T20:24:30.1944192Z     package                    |            build
2025-05-07T20:24:30.1944508Z     ---------------------------|-----------------
2025-05-07T20:24:30.1944883Z     libxcrypt-4.4.36           |       hd590300_1          98 KB  conda-forge
2025-05-07T20:24:30.1945285Z     ------------------------------------------------------------
2025-05-07T20:24:30.1945629Z                                            Total:          98 KB
2025-05-07T20:24:30.1945839Z 
2025-05-07T20:24:30.1945964Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:30.1946180Z 
2025-05-07T20:24:30.1946405Z   libxcrypt          conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 
2025-05-07T20:24:30.1946687Z 
2025-05-07T20:24:30.1946691Z 
2025-05-07T20:24:30.1946700Z 
2025-05-07T20:24:30.1946845Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:30.3529774Z libxcrypt-4.4.36     | 98 KB     |            |   0% 
2025-05-07T20:24:30.3545557Z libxcrypt-4.4.36     | 98 KB     | #6         |  16% 
2025-05-07T20:24:30.3649164Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% 
2025-05-07T20:24:30.3651645Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% 
2025-05-07T20:24:30.3652101Z                                                      
2025-05-07T20:24:30.3652470Z  done
2025-05-07T20:24:30.4655805Z Preparing transaction: / done
2025-05-07T20:24:30.5662084Z Verifying transaction: \ done
2025-05-07T20:24:30.6669502Z Executing transaction: / done
2025-05-07T20:24:34.0883840Z [SETUP] Copying <crypt.h> over ...
2025-05-07T20:24:34.0884529Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.9/crypt.h
2025-05-07T20:24:34.0885064Z 
2025-05-07T20:24:34.0913504Z 
2025-05-07T20:24:35.7221700Z [SETUP] Installed Python version: Python 3.9.21
2025-05-07T20:24:35.7222156Z [SETUP] Successfully created Conda environment: build_binary
2025-05-07T20:24:35.7256194Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc
2025-05-07T20:24:35.7256697Z [36;1m. $PRELUDE; install_cxx_compiler $BUILD_ENV gcc[0m
2025-05-07T20:24:35.7270600Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:24:35.7270952Z env:
2025-05-07T20:24:35.7271190Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:24:35.7271515Z   BUILD_ENV: build_binary
2025-05-07T20:24:35.7271757Z   BUILD_TARGET: genai
2025-05-07T20:24:35.7271987Z   BUILD_VARIANT: cuda
2025-05-07T20:24:35.7272226Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:24:35.7272473Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:24:35.7272774Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:24:35.7273109Z ##[endgroup]
2025-05-07T20:24:36.0621957Z ################################################################################
2025-05-07T20:24:36.0622706Z # Install C/C++ Compilers
2025-05-07T20:24:36.0622951Z #
2025-05-07T20:24:36.0646661Z # [2025-05-07T20:24:36.063Z] + install_cxx_compiler build_binary gcc
2025-05-07T20:24:36.0647130Z ################################################################################
2025-05-07T20:24:36.0647427Z 
2025-05-07T20:24:36.0654221Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:24:36.1543762Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:24:36.1552502Z [INSTALL] Installing GLIBC (architecture = 64) ...
2025-05-07T20:24:36.1574976Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17
2025-05-07T20:24:37.0254470Z Channels:
2025-05-07T20:24:37.0255127Z  - conda-forge
2025-05-07T20:24:37.0255605Z Platform: linux-64
2025-05-07T20:24:40.3381534Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:40.7048061Z Solving environment: \ done
2025-05-07T20:24:40.7659280Z 
2025-05-07T20:24:40.7659734Z ## Package Plan ##
2025-05-07T20:24:40.7660026Z 
2025-05-07T20:24:40.7660477Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:40.7661072Z 
2025-05-07T20:24:40.7661253Z   added / updated specs:
2025-05-07T20:24:40.7661665Z     - sysroot_linux-64=2.17
2025-05-07T20:24:40.7661855Z 
2025-05-07T20:24:40.7661859Z 
2025-05-07T20:24:40.7662010Z The following packages will be downloaded:
2025-05-07T20:24:40.7662223Z 
2025-05-07T20:24:40.7662335Z     package                    |            build
2025-05-07T20:24:40.7662649Z     ---------------------------|-----------------
2025-05-07T20:24:40.7663064Z     kernel-headers_linux-64-3.10.0|      he073ed8_18         921 KB  conda-forge
2025-05-07T20:24:40.7663536Z     sysroot_linux-64-2.17      |      h0157908_18        14.5 MB  conda-forge
2025-05-07T20:24:40.7663945Z     ------------------------------------------------------------
2025-05-07T20:24:40.7664281Z                                            Total:        15.4 MB
2025-05-07T20:24:40.7664495Z 
2025-05-07T20:24:40.7664635Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:40.7664853Z 
2025-05-07T20:24:40.7665131Z   kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 
2025-05-07T20:24:40.7665683Z   sysroot_linux-64   conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 
2025-05-07T20:24:40.7665989Z 
2025-05-07T20:24:40.7665993Z 
2025-05-07T20:24:40.7665997Z 
2025-05-07T20:24:40.7666135Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:40.7666509Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:24:40.7666728Z 
2025-05-07T20:24:40.9750215Z kernel-headers_linux | 921 KB    |            |   0% [A
2025-05-07T20:24:40.9967854Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:24:40.9968212Z 
2025-05-07T20:24:41.0124063Z kernel-headers_linux | 921 KB    | 1          |   2% [A
2025-05-07T20:24:41.0124322Z 
2025-05-07T20:24:41.0750732Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:24:41.1114597Z sysroot_linux-64-2.1 | 14.5 MB   | #########3 |  93% 
2025-05-07T20:24:41.2756071Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:24:41.2756320Z 
2025-05-07T20:24:41.2757118Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:24:41.2757552Z 
2025-05-07T20:24:41.7138586Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:24:41.7142704Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:24:41.7143229Z                                                      
2025-05-07T20:24:41.7143514Z 
2025-05-07T20:24:41.7143798Z                                                      [A done
2025-05-07T20:24:41.8149050Z Preparing transaction: / done
2025-05-07T20:24:42.0156405Z Verifying transaction: \ | done
2025-05-07T20:24:42.2231414Z Executing transaction: - \ done
2025-05-07T20:24:42.3753437Z [CHECK] LD_LIBRARY_PATH = 
2025-05-07T20:24:42.3753729Z [CHECK] CONDA_PREFIX is not set.
2025-05-07T20:24:44.0769907Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6
2025-05-07T20:24:44.0783124Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ...
2025-05-07T20:24:44.0807032Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0
2025-05-07T20:24:44.9718367Z Channels:
2025-05-07T20:24:44.9718593Z  - conda-forge
2025-05-07T20:24:44.9718820Z Platform: linux-64
2025-05-07T20:24:48.2311424Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:49.1870806Z Solving environment: \ | / done
2025-05-07T20:24:49.2506786Z 
2025-05-07T20:24:49.2506947Z ## Package Plan ##
2025-05-07T20:24:49.2507098Z 
2025-05-07T20:24:49.2507314Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:49.2507610Z 
2025-05-07T20:24:49.2507715Z   added / updated specs:
2025-05-07T20:24:49.2507969Z     - gxx_linux-64=11.4.0
2025-05-07T20:24:49.2508163Z 
2025-05-07T20:24:49.2508167Z 
2025-05-07T20:24:49.2508283Z The following packages will be downloaded:
2025-05-07T20:24:49.2508504Z 
2025-05-07T20:24:49.2508624Z     package                    |            build
2025-05-07T20:24:49.2508956Z     ---------------------------|-----------------
2025-05-07T20:24:49.2509349Z     binutils_impl_linux-64-2.40|       ha1999f0_7         6.0 MB  conda-forge
2025-05-07T20:24:49.2509915Z     binutils_linux-64-2.40     |       hb3c18ed_4          28 KB  conda-forge
2025-05-07T20:24:49.2510372Z     gcc_impl_linux-64-11.4.0   |      h00c12a0_13        53.0 MB  conda-forge
2025-05-07T20:24:49.2510801Z     gcc_linux-64-11.4.0        |       ha077dfb_4          31 KB  conda-forge
2025-05-07T20:24:49.2511235Z     gxx_impl_linux-64-11.4.0   |      h634f3ee_13        11.2 MB  conda-forge
2025-05-07T20:24:49.2511664Z     gxx_linux-64-11.4.0        |       h35bfe5d_4          29 KB  conda-forge
2025-05-07T20:24:49.2512085Z     ld_impl_linux-64-2.40      |       hf3520f5_7         691 KB  conda-forge
2025-05-07T20:24:49.2512548Z     libgcc-devel_linux-64-11.4.0|     h8f596e0_113         2.3 MB  conda-forge
2025-05-07T20:24:49.2513012Z     libsanitizer-11.4.0        |      h5763a12_13         3.5 MB  conda-forge
2025-05-07T20:24:49.2513442Z     libstdcxx-15.1.0           |       h8f9b012_2         3.7 MB  conda-forge
2025-05-07T20:24:49.2513934Z     libstdcxx-devel_linux-64-11.4.0|     h8f596e0_113        11.1 MB  conda-forge
2025-05-07T20:24:49.2514415Z     libstdcxx-ng-15.1.0        |       h4852527_2          34 KB  conda-forge
2025-05-07T20:24:49.2514809Z     ------------------------------------------------------------
2025-05-07T20:24:49.2515141Z                                            Total:        91.6 MB
2025-05-07T20:24:49.2515343Z 
2025-05-07T20:24:49.2515472Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:49.2515691Z 
2025-05-07T20:24:49.2515954Z   binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 
2025-05-07T20:24:49.2516506Z   binutils_linux-64  conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 
2025-05-07T20:24:49.2517389Z   gcc_impl_linux-64  conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 
2025-05-07T20:24:49.2517889Z   gcc_linux-64       conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 
2025-05-07T20:24:49.2518385Z   gxx_impl_linux-64  conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 
2025-05-07T20:24:49.2518874Z   gxx_linux-64       conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 
2025-05-07T20:24:49.2519388Z   libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:24:49.2519925Z   libsanitizer       conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 
2025-05-07T20:24:49.2520406Z   libstdcxx          conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 
2025-05-07T20:24:49.2520933Z   libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:24:49.2521286Z 
2025-05-07T20:24:49.2521544Z The following packages will be UPDATED:
2025-05-07T20:24:49.2521744Z 
2025-05-07T20:24:49.2522054Z   ld_impl_linux-64   pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 
2025-05-07T20:24:49.2522755Z   libstdcxx-ng       pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 
2025-05-07T20:24:49.2523157Z 
2025-05-07T20:24:49.2523161Z 
2025-05-07T20:24:49.2523165Z 
2025-05-07T20:24:49.2523304Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:49.2523672Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:24:49.2523894Z 
2025-05-07T20:24:49.2524209Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:24:49.2524438Z 
2025-05-07T20:24:49.2524442Z 
2025-05-07T20:24:49.2532151Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:24:49.2532409Z 
2025-05-07T20:24:49.2532555Z 
2025-05-07T20:24:49.2538593Z 
2025-05-07T20:24:49.2547368Z binutils_impl_linux- | 6.0 MB    |            |   0% [A[A[A
2025-05-07T20:24:49.2547643Z 
2025-05-07T20:24:49.2547833Z 
2025-05-07T20:24:49.2547839Z 
2025-05-07T20:24:49.2551761Z 
2025-05-07T20:24:49.2565744Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:24:49.2566033Z 
2025-05-07T20:24:49.2566045Z 
2025-05-07T20:24:49.2566049Z 
2025-05-07T20:24:49.2566053Z 
2025-05-07T20:24:49.2580056Z 
2025-05-07T20:24:49.2582121Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:49.2582457Z 
2025-05-07T20:24:49.2582461Z 
2025-05-07T20:24:49.2582465Z 
2025-05-07T20:24:49.2582469Z 
2025-05-07T20:24:49.2582473Z 
2025-05-07T20:24:49.2582476Z 
2025-05-07T20:24:49.2589479Z libgcc-devel_linux-6 | 2.3 MB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:24:49.2589867Z 
2025-05-07T20:24:49.2589871Z 
2025-05-07T20:24:49.2589875Z 
2025-05-07T20:24:49.2589879Z 
2025-05-07T20:24:49.2589883Z 
2025-05-07T20:24:49.2589886Z 
2025-05-07T20:24:49.2600580Z 
2025-05-07T20:24:49.2604145Z ld_impl_linux-64-2.4 | 691 KB    |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:24:49.2604528Z 
2025-05-07T20:24:49.2604532Z 
2025-05-07T20:24:49.2604545Z 
2025-05-07T20:24:49.2604549Z 
2025-05-07T20:24:49.2604553Z 
2025-05-07T20:24:49.2604556Z 
2025-05-07T20:24:49.2604560Z 
2025-05-07T20:24:49.2604564Z 
2025-05-07T20:24:49.2609445Z libstdcxx-ng-15.1.0  | 34 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:24:49.2609783Z 
2025-05-07T20:24:49.2609786Z 
2025-05-07T20:24:49.2609790Z 
2025-05-07T20:24:49.2609794Z 
2025-05-07T20:24:49.2609797Z 
2025-05-07T20:24:49.2609801Z 
2025-05-07T20:24:49.2609805Z 
2025-05-07T20:24:49.2609809Z 
2025-05-07T20:24:49.2609812Z 
2025-05-07T20:24:49.2614494Z gcc_linux-64-11.4.0  | 31 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:49.2614791Z 
2025-05-07T20:24:49.2614797Z 
2025-05-07T20:24:49.2614809Z 
2025-05-07T20:24:49.2614813Z 
2025-05-07T20:24:49.2614816Z 
2025-05-07T20:24:49.2614820Z 
2025-05-07T20:24:49.2614823Z 
2025-05-07T20:24:49.2614837Z 
2025-05-07T20:24:49.2614841Z 
2025-05-07T20:24:49.2614845Z 
2025-05-07T20:24:49.2615900Z gxx_linux-64-11.4.0  | 29 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:49.2616191Z 
2025-05-07T20:24:49.2616195Z 
2025-05-07T20:24:49.2616199Z 
2025-05-07T20:24:49.2616208Z 
2025-05-07T20:24:49.2616212Z 
2025-05-07T20:24:49.2616215Z 
2025-05-07T20:24:49.2616219Z 
2025-05-07T20:24:49.2616223Z 
2025-05-07T20:24:49.2616227Z 
2025-05-07T20:24:49.2616230Z 
2025-05-07T20:24:49.2616234Z 
2025-05-07T20:24:49.3582953Z binutils_linux-64-2. | 28 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:49.3583285Z 
2025-05-07T20:24:49.3583289Z 
2025-05-07T20:24:49.3634866Z 
2025-05-07T20:24:49.3850507Z binutils_impl_linux- | 6.0 MB    | 3          |   3% [A[A[A
2025-05-07T20:24:49.3850899Z 
2025-05-07T20:24:49.3850913Z 
2025-05-07T20:24:49.3850919Z 
2025-05-07T20:24:49.3852011Z 
2025-05-07T20:24:49.4618426Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:24:49.4619006Z 
2025-05-07T20:24:49.4619011Z 
2025-05-07T20:24:49.4619016Z 
2025-05-07T20:24:49.4932834Z binutils_impl_linux- | 6.0 MB    | ##2        |  23% [A[A[A
2025-05-07T20:24:49.4933137Z 
2025-05-07T20:24:49.4933141Z 
2025-05-07T20:24:49.4933145Z 
2025-05-07T20:24:49.5037040Z 
2025-05-07T20:24:49.5280662Z libstdcxx-15.1.0     | 3.7 MB    | #6         |  16% [A[A[A[A
2025-05-07T20:24:49.5620131Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:24:49.5620456Z 
2025-05-07T20:24:49.5620460Z 
2025-05-07T20:24:49.5620522Z 
2025-05-07T20:24:49.5934956Z binutils_impl_linux- | 6.0 MB    | #####1     |  51% [A[A[A
2025-05-07T20:24:49.5935235Z 
2025-05-07T20:24:49.5935239Z 
2025-05-07T20:24:49.5935243Z 
2025-05-07T20:24:49.5935247Z 
2025-05-07T20:24:49.6022324Z libstdcxx-15.1.0     | 3.7 MB    | #########9 | 100% [A[A[A[A
2025-05-07T20:24:49.6022665Z 
2025-05-07T20:24:49.6205257Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:24:49.6205531Z 
2025-05-07T20:24:49.6206146Z 
2025-05-07T20:24:49.6280764Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:24:49.6533441Z gcc_impl_linux-64-11 | 53.0 MB   | 9          |   9% 
2025-05-07T20:24:49.6533735Z 
2025-05-07T20:24:49.6533739Z 
2025-05-07T20:24:49.6533743Z 
2025-05-07T20:24:49.6535521Z 
2025-05-07T20:24:49.7019162Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:49.7019433Z 
2025-05-07T20:24:49.7019437Z 
2025-05-07T20:24:49.7019441Z 
2025-05-07T20:24:49.7022812Z 
2025-05-07T20:24:49.7022820Z 
2025-05-07T20:24:49.7027359Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:49.7027651Z 
2025-05-07T20:24:49.7212058Z gxx_impl_linux-64-11 | 11.2 MB   | ##5        |  26% [A
2025-05-07T20:24:49.7212316Z 
2025-05-07T20:24:49.7214615Z 
2025-05-07T20:24:49.7283266Z libstdcxx-devel_linu | 11.1 MB   | ##9        |  30% [A[A
2025-05-07T20:24:49.7630178Z gcc_impl_linux-64-11 | 53.0 MB   | #5         |  16% 
2025-05-07T20:24:49.7630485Z 
2025-05-07T20:24:49.7630489Z 
2025-05-07T20:24:49.7630499Z 
2025-05-07T20:24:49.7630793Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:24:49.7631052Z 
2025-05-07T20:24:49.7631056Z 
2025-05-07T20:24:49.7631064Z 
2025-05-07T20:24:49.8027894Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:24:49.8028291Z 
2025-05-07T20:24:49.8028296Z 
2025-05-07T20:24:49.8028302Z 
2025-05-07T20:24:49.8028306Z 
2025-05-07T20:24:49.8028310Z 
2025-05-07T20:24:49.8030404Z libsanitizer-11.4.0  | 3.5 MB    | ########   |  81% [A[A[A[A[A
2025-05-07T20:24:49.8030694Z 
2025-05-07T20:24:49.8143126Z gxx_impl_linux-64-11 | 11.2 MB   | ####9      |  49% [A
2025-05-07T20:24:49.8143380Z 
2025-05-07T20:24:49.8143384Z 
2025-05-07T20:24:49.8143388Z 
2025-05-07T20:24:49.8143392Z 
2025-05-07T20:24:49.8143396Z 
2025-05-07T20:24:49.8145516Z 
2025-05-07T20:24:49.8214978Z libgcc-devel_linux-6 | 2.3 MB    |            |   1% [A[A[A[A[A[A
2025-05-07T20:24:49.8215296Z 
2025-05-07T20:24:49.8215300Z 
2025-05-07T20:24:49.8533790Z libstdcxx-devel_linu | 11.1 MB   | #####3     |  53% [A[A
2025-05-07T20:24:49.9032516Z gcc_impl_linux-64-11 | 53.0 MB   | ##1        |  22% 
2025-05-07T20:24:49.9032772Z 
2025-05-07T20:24:49.9219047Z gxx_impl_linux-64-11 | 11.2 MB   | #######1   |  72% [A
2025-05-07T20:24:49.9219301Z 
2025-05-07T20:24:49.9219306Z 
2025-05-07T20:24:49.9650943Z libstdcxx-devel_linu | 11.1 MB   | #######6   |  76% [A[A
2025-05-07T20:24:49.9651237Z 
2025-05-07T20:24:49.9651241Z 
2025-05-07T20:24:49.9651245Z 
2025-05-07T20:24:49.9651249Z 
2025-05-07T20:24:49.9652675Z 
2025-05-07T20:24:49.9768606Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:49.9992762Z gcc_impl_linux-64-11 | 53.0 MB   | ##7        |  27% 
2025-05-07T20:24:49.9993121Z 
2025-05-07T20:24:49.9993127Z 
2025-05-07T20:24:49.9993133Z 
2025-05-07T20:24:49.9993138Z 
2025-05-07T20:24:49.9993143Z 
2025-05-07T20:24:49.9995618Z 
2025-05-07T20:24:50.0001031Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:50.0001451Z 
2025-05-07T20:24:50.0001457Z 
2025-05-07T20:24:50.0001479Z 
2025-05-07T20:24:50.0001485Z 
2025-05-07T20:24:50.0001490Z 
2025-05-07T20:24:50.0001495Z 
2025-05-07T20:24:50.0034607Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:50.0035280Z 
2025-05-07T20:24:50.0221123Z gxx_impl_linux-64-11 | 11.2 MB   | #########6 |  97% [A
2025-05-07T20:24:50.0221486Z 
2025-05-07T20:24:50.0221492Z 
2025-05-07T20:24:50.0221497Z 
2025-05-07T20:24:50.0221503Z 
2025-05-07T20:24:50.0221508Z 
2025-05-07T20:24:50.0221513Z 
2025-05-07T20:24:50.0221518Z 
2025-05-07T20:24:50.0417369Z ld_impl_linux-64-2.4 | 691 KB    | 2          |   2% [A[A[A[A[A[A[A
2025-05-07T20:24:50.0417772Z 
2025-05-07T20:24:50.0417777Z 
2025-05-07T20:24:50.0417783Z 
2025-05-07T20:24:50.0417788Z 
2025-05-07T20:24:50.0417794Z 
2025-05-07T20:24:50.0417799Z 
2025-05-07T20:24:50.0417805Z 
2025-05-07T20:24:50.0417811Z 
2025-05-07T20:24:50.0476412Z libstdcxx-ng-15.1.0  | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A
2025-05-07T20:24:50.0476823Z 
2025-05-07T20:24:50.0476849Z 
2025-05-07T20:24:50.0476854Z 
2025-05-07T20:24:50.0476860Z 
2025-05-07T20:24:50.0476865Z 
2025-05-07T20:24:50.0476871Z 
2025-05-07T20:24:50.0476876Z 
2025-05-07T20:24:50.0480280Z 
2025-05-07T20:24:50.0687382Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:50.0687792Z 
2025-05-07T20:24:50.0687798Z 
2025-05-07T20:24:50.0687803Z 
2025-05-07T20:24:50.0687808Z 
2025-05-07T20:24:50.0687814Z 
2025-05-07T20:24:50.0687819Z 
2025-05-07T20:24:50.0688158Z 
2025-05-07T20:24:50.0772598Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:50.0876275Z gcc_impl_linux-64-11 | 53.0 MB   | ###2       |  33% 
2025-05-07T20:24:50.0876613Z 
2025-05-07T20:24:50.0876619Z 
2025-05-07T20:24:50.0876633Z 
2025-05-07T20:24:50.0876639Z 
2025-05-07T20:24:50.0876644Z 
2025-05-07T20:24:50.0876673Z 
2025-05-07T20:24:50.0876678Z 
2025-05-07T20:24:50.0876684Z 
2025-05-07T20:24:50.0876689Z 
2025-05-07T20:24:50.0907718Z gcc_linux-64-11.4.0  | 31 KB     | #####2     |  52% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.0908118Z 
2025-05-07T20:24:50.0908125Z 
2025-05-07T20:24:50.0908130Z 
2025-05-07T20:24:50.0908136Z 
2025-05-07T20:24:50.0908141Z 
2025-05-07T20:24:50.0908147Z 
2025-05-07T20:24:50.0908152Z 
2025-05-07T20:24:50.0908157Z 
2025-05-07T20:24:50.0908347Z 
2025-05-07T20:24:50.1239506Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.1239833Z 
2025-05-07T20:24:50.1239839Z 
2025-05-07T20:24:50.1239844Z 
2025-05-07T20:24:50.1239849Z 
2025-05-07T20:24:50.1239855Z 
2025-05-07T20:24:50.1239860Z 
2025-05-07T20:24:50.1239866Z 
2025-05-07T20:24:50.1239871Z 
2025-05-07T20:24:50.1239876Z 
2025-05-07T20:24:50.1239888Z 
2025-05-07T20:24:50.1281602Z gxx_linux-64-11.4.0  | 29 KB     | #####5     |  55% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.1282052Z 
2025-05-07T20:24:50.1282060Z 
2025-05-07T20:24:50.1282068Z 
2025-05-07T20:24:50.1282082Z 
2025-05-07T20:24:50.1282088Z 
2025-05-07T20:24:50.1282094Z 
2025-05-07T20:24:50.1282352Z 
2025-05-07T20:24:50.1282362Z 
2025-05-07T20:24:50.1282367Z 
2025-05-07T20:24:50.1284466Z 
2025-05-07T20:24:50.1335168Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.1335654Z 
2025-05-07T20:24:50.1335664Z 
2025-05-07T20:24:50.1335674Z 
2025-05-07T20:24:50.1335685Z 
2025-05-07T20:24:50.1335696Z 
2025-05-07T20:24:50.1335705Z 
2025-05-07T20:24:50.1335714Z 
2025-05-07T20:24:50.1335722Z 
2025-05-07T20:24:50.1335732Z 
2025-05-07T20:24:50.1335741Z 
2025-05-07T20:24:50.1338130Z 
2025-05-07T20:24:50.1394797Z binutils_linux-64-2. | 28 KB     | #####6     |  56% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.1395192Z 
2025-05-07T20:24:50.1395198Z 
2025-05-07T20:24:50.1395203Z 
2025-05-07T20:24:50.1395207Z 
2025-05-07T20:24:50.1395212Z 
2025-05-07T20:24:50.1395216Z 
2025-05-07T20:24:50.1395451Z 
2025-05-07T20:24:50.1395456Z 
2025-05-07T20:24:50.1395460Z 
2025-05-07T20:24:50.1395465Z 
2025-05-07T20:24:50.1397556Z 
2025-05-07T20:24:50.1774990Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.2147304Z gcc_impl_linux-64-11 | 53.0 MB   | ####       |  40% 
2025-05-07T20:24:50.2147567Z 
2025-05-07T20:24:50.2147576Z 
2025-05-07T20:24:50.2147581Z 
2025-05-07T20:24:50.2147586Z 
2025-05-07T20:24:50.2778054Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:50.3145978Z gcc_impl_linux-64-11 | 53.0 MB   | ####8      |  48% 
2025-05-07T20:24:50.3148777Z 
2025-05-07T20:24:50.3355931Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:24:50.3356212Z 
2025-05-07T20:24:50.3359703Z 
2025-05-07T20:24:50.3360143Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:24:50.3360472Z 
2025-05-07T20:24:50.3360479Z 
2025-05-07T20:24:50.3780143Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:24:50.4390508Z gcc_impl_linux-64-11 | 53.0 MB   | #####7     |  58% 
2025-05-07T20:24:50.4390943Z 
2025-05-07T20:24:50.4390981Z 
2025-05-07T20:24:50.4390987Z 
2025-05-07T20:24:50.4390992Z 
2025-05-07T20:24:50.4390997Z 
2025-05-07T20:24:50.4391003Z 
2025-05-07T20:24:50.4783050Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:50.4861608Z gcc_impl_linux-64-11 | 53.0 MB   | #######2   |  72% 
2025-05-07T20:24:50.4862275Z 
2025-05-07T20:24:50.4862281Z 
2025-05-07T20:24:50.4862287Z 
2025-05-07T20:24:50.4862292Z 
2025-05-07T20:24:50.4863861Z 
2025-05-07T20:24:50.5212629Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:50.5213298Z 
2025-05-07T20:24:50.5213302Z 
2025-05-07T20:24:50.5213307Z 
2025-05-07T20:24:50.5213310Z 
2025-05-07T20:24:50.5213314Z 
2025-05-07T20:24:50.5213318Z 
2025-05-07T20:24:50.5213323Z 
2025-05-07T20:24:50.5213328Z 
2025-05-07T20:24:50.5217479Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:50.5217966Z 
2025-05-07T20:24:50.5217972Z 
2025-05-07T20:24:50.5217978Z 
2025-05-07T20:24:50.5218000Z 
2025-05-07T20:24:50.5218006Z 
2025-05-07T20:24:50.5218012Z 
2025-05-07T20:24:50.5218017Z 
2025-05-07T20:24:50.5218083Z 
2025-05-07T20:24:50.5789964Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:50.5987626Z gcc_impl_linux-64-11 | 53.0 MB   | ########1  |  82% 
2025-05-07T20:24:50.5988077Z 
2025-05-07T20:24:50.5988084Z 
2025-05-07T20:24:50.5988091Z 
2025-05-07T20:24:50.5988096Z 
2025-05-07T20:24:50.5988101Z 
2025-05-07T20:24:50.5988108Z 
2025-05-07T20:24:50.5988112Z 
2025-05-07T20:24:50.5988977Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:50.5989389Z 
2025-05-07T20:24:50.5989395Z 
2025-05-07T20:24:50.5989400Z 
2025-05-07T20:24:50.5989405Z 
2025-05-07T20:24:50.5989415Z 
2025-05-07T20:24:50.5989421Z 
2025-05-07T20:24:50.5989427Z 
2025-05-07T20:24:50.6021498Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:50.6022101Z 
2025-05-07T20:24:50.6022107Z 
2025-05-07T20:24:50.6022382Z 
2025-05-07T20:24:50.6022390Z 
2025-05-07T20:24:50.6022395Z 
2025-05-07T20:24:50.6022401Z 
2025-05-07T20:24:50.6022406Z 
2025-05-07T20:24:50.6022411Z 
2025-05-07T20:24:50.6022416Z 
2025-05-07T20:24:50.6026668Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.6027116Z 
2025-05-07T20:24:50.6027122Z 
2025-05-07T20:24:50.6027127Z 
2025-05-07T20:24:50.6027132Z 
2025-05-07T20:24:50.6027138Z 
2025-05-07T20:24:50.6027143Z 
2025-05-07T20:24:50.6027226Z 
2025-05-07T20:24:50.6027231Z 
2025-05-07T20:24:50.6027236Z 
2025-05-07T20:24:50.6677247Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.6677601Z 
2025-05-07T20:24:50.6677606Z 
2025-05-07T20:24:50.6677611Z 
2025-05-07T20:24:50.6677615Z 
2025-05-07T20:24:50.6677620Z 
2025-05-07T20:24:50.6677624Z 
2025-05-07T20:24:50.6677874Z 
2025-05-07T20:24:50.6677879Z 
2025-05-07T20:24:50.6677883Z 
2025-05-07T20:24:50.6678112Z 
2025-05-07T20:24:50.6678122Z 
2025-05-07T20:24:50.6680341Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.6680820Z 
2025-05-07T20:24:50.6680827Z 
2025-05-07T20:24:50.6680878Z 
2025-05-07T20:24:50.6680884Z 
2025-05-07T20:24:50.6680889Z 
2025-05-07T20:24:50.6680895Z 
2025-05-07T20:24:50.6680900Z 
2025-05-07T20:24:50.6680906Z 
2025-05-07T20:24:50.6680911Z 
2025-05-07T20:24:50.6680916Z 
2025-05-07T20:24:50.6680922Z 
2025-05-07T20:24:50.6709794Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.6710289Z 
2025-05-07T20:24:50.6710294Z 
2025-05-07T20:24:50.6710300Z 
2025-05-07T20:24:50.6710305Z 
2025-05-07T20:24:50.6710310Z 
2025-05-07T20:24:50.6710316Z 
2025-05-07T20:24:50.6710321Z 
2025-05-07T20:24:50.6710326Z 
2025-05-07T20:24:50.6710331Z 
2025-05-07T20:24:50.6710337Z 
2025-05-07T20:24:50.6717799Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.6718140Z 
2025-05-07T20:24:50.6718152Z 
2025-05-07T20:24:50.6718156Z 
2025-05-07T20:24:50.6718160Z 
2025-05-07T20:24:50.6718164Z 
2025-05-07T20:24:50.6718167Z 
2025-05-07T20:24:50.6718171Z 
2025-05-07T20:24:50.6718175Z 
2025-05-07T20:24:50.6718178Z 
2025-05-07T20:24:50.6718182Z 
2025-05-07T20:24:50.6790611Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.7660819Z gcc_impl_linux-64-11 | 53.0 MB   | #########8 |  98% 
2025-05-07T20:24:50.7661330Z 
2025-05-07T20:24:50.7661334Z 
2025-05-07T20:24:50.7662667Z 
2025-05-07T20:24:50.8983099Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:24:50.9857519Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:24:50.9858245Z 
2025-05-07T20:24:51.2230004Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:24:51.2230311Z 
2025-05-07T20:24:51.2230346Z 
2025-05-07T20:24:51.6342830Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:24:51.6349279Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:24:51.6349890Z                                                      
2025-05-07T20:24:51.6350234Z 
2025-05-07T20:24:51.6350664Z                                                      [A
2025-05-07T20:24:51.6350981Z 
2025-05-07T20:24:51.6351031Z 
2025-05-07T20:24:51.6351318Z                                                      [A[A
2025-05-07T20:24:51.6351551Z 
2025-05-07T20:24:51.6351555Z 
2025-05-07T20:24:51.6351558Z 
2025-05-07T20:24:51.6351778Z                                                      [A[A[A
2025-05-07T20:24:51.6352073Z 
2025-05-07T20:24:51.6352076Z 
2025-05-07T20:24:51.6352080Z 
2025-05-07T20:24:51.6352084Z 
2025-05-07T20:24:51.6352400Z                                                      [A[A[A[A
2025-05-07T20:24:51.6352693Z 
2025-05-07T20:24:51.6352697Z 
2025-05-07T20:24:51.6352701Z 
2025-05-07T20:24:51.6352717Z 
2025-05-07T20:24:51.6352721Z 
2025-05-07T20:24:51.6352979Z                                                      [A[A[A[A[A
2025-05-07T20:24:51.6353430Z 
2025-05-07T20:24:51.6353435Z 
2025-05-07T20:24:51.6353439Z 
2025-05-07T20:24:51.6353442Z 
2025-05-07T20:24:51.6353446Z 
2025-05-07T20:24:51.6353450Z 
2025-05-07T20:24:51.6353805Z                                                      [A[A[A[A[A[A
2025-05-07T20:24:51.6354061Z 
2025-05-07T20:24:51.6354065Z 
2025-05-07T20:24:51.6354068Z 
2025-05-07T20:24:51.6354072Z 
2025-05-07T20:24:51.6354076Z 
2025-05-07T20:24:51.6354079Z 
2025-05-07T20:24:51.6354083Z 
2025-05-07T20:24:51.6354324Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:24:51.6354563Z 
2025-05-07T20:24:51.6354567Z 
2025-05-07T20:24:51.6354571Z 
2025-05-07T20:24:51.6354575Z 
2025-05-07T20:24:51.6354578Z 
2025-05-07T20:24:51.6354582Z 
2025-05-07T20:24:51.6354586Z 
2025-05-07T20:24:51.6354589Z 
2025-05-07T20:24:51.6354833Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:24:51.6355227Z 
2025-05-07T20:24:51.6355231Z 
2025-05-07T20:24:51.6355246Z 
2025-05-07T20:24:51.6355252Z 
2025-05-07T20:24:51.6355256Z 
2025-05-07T20:24:51.6355260Z 
2025-05-07T20:24:51.6355263Z 
2025-05-07T20:24:51.6355354Z 
2025-05-07T20:24:51.6355358Z 
2025-05-07T20:24:51.6355627Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.6355869Z 
2025-05-07T20:24:51.6355872Z 
2025-05-07T20:24:51.6355876Z 
2025-05-07T20:24:51.6355880Z 
2025-05-07T20:24:51.6355883Z 
2025-05-07T20:24:51.6355921Z 
2025-05-07T20:24:51.6355925Z 
2025-05-07T20:24:51.6355929Z 
2025-05-07T20:24:51.6355932Z 
2025-05-07T20:24:51.6355936Z 
2025-05-07T20:24:51.6356150Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.6356417Z 
2025-05-07T20:24:51.6356421Z 
2025-05-07T20:24:51.6356459Z 
2025-05-07T20:24:51.6356463Z 
2025-05-07T20:24:51.6356467Z 
2025-05-07T20:24:51.6356477Z 
2025-05-07T20:24:51.6356480Z 
2025-05-07T20:24:51.6356484Z 
2025-05-07T20:24:51.6356487Z 
2025-05-07T20:24:51.6356491Z 
2025-05-07T20:24:51.6356498Z 
2025-05-07T20:24:51.6356709Z                                                      [A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:24:51.7358331Z Preparing transaction: \ done
2025-05-07T20:24:52.0366707Z Verifying transaction: / - \ done
2025-05-07T20:24:52.1377103Z Executing transaction: / done
2025-05-07T20:24:52.2991185Z [INSTALL] Setting the C/C++ compiler symlinks ...
2025-05-07T20:24:56.1819109Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:24:56.1819733Z 
2025-05-07T20:24:56.1833355Z 
2025-05-07T20:24:56.1850955Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:24:56.1851527Z 
2025-05-07T20:24:56.1862916Z 
2025-05-07T20:24:56.1880209Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:24:56.1880829Z 
2025-05-07T20:24:56.1892810Z 
2025-05-07T20:24:56.1910809Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:24:56.1911394Z 
2025-05-07T20:24:56.1922565Z 
2025-05-07T20:24:58.0841490Z /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:24:58.0842041Z 
2025-05-07T20:24:58.1473154Z [CHECK] Binary cc found in PATH
2025-05-07T20:25:00.0280380Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:25:00.0281061Z 
2025-05-07T20:25:00.0903544Z [CHECK] Binary gcc found in PATH
2025-05-07T20:25:01.9653904Z /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:25:01.9654331Z 
2025-05-07T20:25:02.0268756Z [CHECK] Binary c++ found in PATH
2025-05-07T20:25:03.9045444Z /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:25:03.9045898Z 
2025-05-07T20:25:03.9678623Z [CHECK] Binary g++ found in PATH
2025-05-07T20:25:03.9682406Z [INFO] Printing out all preprocessor defines in the C compiler ...
2025-05-07T20:25:03.9683135Z + conda run -n build_binary cc -dM -E -
2025-05-07T20:25:03.9683372Z 
2025-05-07T20:25:05.8486761Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:05.8487343Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:05.8488030Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:05.8489048Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:05.8489623Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:05.8490310Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:05.8490774Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:05.8491403Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:05.8491927Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:05.8492322Z #define __CHAR_BIT__ 8
2025-05-07T20:25:05.8492773Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:05.8493316Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:05.8494145Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:05.8494494Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:05.8494933Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:05.8495347Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:05.8495685Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:05.8496117Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:05.8496548Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:05.8496906Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:05.8497450Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:05.8497963Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:05.8498387Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:05.8498732Z #define __GCC_IEC_559 2
2025-05-07T20:25:05.8499084Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:05.8499477Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:05.8499811Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:05.8500207Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:05.8500659Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:05.8501043Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:05.8501418Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:05.8501817Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:05.8502207Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:05.8502525Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:05.8502910Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:05.8503294Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:05.8503608Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:05.8504590Z #define __INT8_C(c) c
2025-05-07T20:25:05.8504979Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:05.8505345Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:05.8505796Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:05.8506240Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:05.8506681Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:05.8507066Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:05.8507489Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:05.8507882Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:05.8508281Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:05.8508786Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:05.8509291Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:05.8509797Z #define __linux 1
2025-05-07T20:25:05.8510142Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:05.8510517Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:05.8510968Z #define __unix 1
2025-05-07T20:25:05.8511251Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:05.8511646Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:05.8512054Z #define __WINT_MIN__ 0U
2025-05-07T20:25:05.8512354Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:05.8512748Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:05.8513161Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:05.8513478Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:05.8514064Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:05.8514498Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:05.8514848Z #define __INT64_C(c) c ## L
2025-05-07T20:25:05.8515223Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:05.8515652Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:05.8516024Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:05.8516427Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:05.8516944Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:05.8517306Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:05.8517619Z #define __DBL_DIG__ 15
2025-05-07T20:25:05.8517985Z #define __FLT32_DIG__ 6
2025-05-07T20:25:05.8518401Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:05.8518805Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:05.8519198Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:05.8519773Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:05.8520188Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:05.8520591Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:05.8520941Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:05.8521391Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:05.8521929Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:05.8522290Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:05.8522619Z #define __unix__ 1
2025-05-07T20:25:05.8523041Z #define __INT_WIDTH__ 32
2025-05-07T20:25:05.8523335Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:05.8523679Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:05.8524094Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:05.8524415Z #define __UINT16_C(c) c
2025-05-07T20:25:05.8524745Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:05.8525150Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:05.8525559Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:05.8526025Z #define __gnu_linux__ 1
2025-05-07T20:25:05.8526421Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:05.8526756Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:05.8527145Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:05.8527430Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:05.8527721Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:05.8527977Z #define __GNUC__ 11
2025-05-07T20:25:05.8528188Z #define __pie__ 2
2025-05-07T20:25:05.8528405Z #define __MMX__ 1
2025-05-07T20:25:05.8528628Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:05.8528887Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:05.8529168Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:05.8529436Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:05.8529782Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:05.8530175Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:05.8530492Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:05.8530761Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:05.8531019Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:05.8531325Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:05.8531671Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:05.8531930Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:05.8532212Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:05.8532501Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:05.8532765Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:05.8543569Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:05.8543842Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:05.8544103Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:05.8544369Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:05.8544624Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:05.8544873Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:05.8545186Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:05.8545539Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:05.8545817Z #define __SSE2_MATH__ 1
2025-05-07T20:25:05.8546055Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:05.8546569Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:05.8546857Z #define __amd64 1
2025-05-07T20:25:05.8547072Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:05.8547334Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:05.8547633Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:05.8547934Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:05.8548184Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:05.8548455Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:05.8548699Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:05.8548961Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:05.8549222Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:05.8549481Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:05.8549881Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:05.8550174Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:05.8550417Z #define __x86_64 1
2025-05-07T20:25:05.8550758Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:05.8551136Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:05.8551612Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:05.8552062Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:05.8552538Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:05.8552933Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:05.8553184Z #define __LP64__ 1
2025-05-07T20:25:05.8553426Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:05.8553773Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:05.8554151Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:05.8554426Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:05.8554698Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:05.8554978Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:05.8555254Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:05.8555522Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:05.8555783Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:05.8556046Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:05.8556302Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:05.8556636Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:05.8556999Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:05.8557276Z #define __FLT_DIG__ 6
2025-05-07T20:25:05.8557503Z #define __NO_INLINE__ 1
2025-05-07T20:25:05.8557746Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:05.8558072Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:05.8558415Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:05.8558671Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:05.8558933Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:05.8559184Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:05.8559444Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:05.8559705Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:05.8559998Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:05.8560286Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:05.8560566Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:05.8560868Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:05.8561198Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:05.8561464Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:05.8561726Z #define __FLT128_DIG__ 33
2025-05-07T20:25:05.8561960Z #define __INT32_C(c) c
2025-05-07T20:25:05.8562205Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:05.8562487Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:05.8562765Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:05.8563047Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:05.8563364Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:05.8563667Z #define unix 1
2025-05-07T20:25:05.8563898Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:05.8564211Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:05.8564516Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:05.8564829Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:05.8565271Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:05.8565522Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:05.8565785Z #define __ELF__ 1
2025-05-07T20:25:05.8566017Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:05.8566293Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:05.8566569Z #define __FLT_RADIX__ 2
2025-05-07T20:25:05.8566817Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:05.8567179Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:05.8567537Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:05.8567792Z #define __SSE_MATH__ 1
2025-05-07T20:25:05.8568020Z #define __k8 1
2025-05-07T20:25:05.8568312Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:05.8568687Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:05.8568986Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:05.8569369Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:05.8569628Z #define __LDBL_DIG__ 18
2025-05-07T20:25:05.8569881Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:05.8570135Z #define __x86_64__ 1
2025-05-07T20:25:05.8570375Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:05.8570677Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:05.8571018Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:05.8571320Z #define __FLT64_DIG__ 15
2025-05-07T20:25:05.8571605Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:05.8571955Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:05.8572263Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:05.8572532Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:05.8572810Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:05.8573100Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:05.8573470Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:05.8573870Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:05.8574159Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:05.8574500Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:05.8574824Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:05.8575122Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:05.8575398Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:05.8575709Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:05.8575998Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:05.8576231Z #define __SEG_FS 1
2025-05-07T20:25:05.8576464Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:05.8576742Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:05.8577013Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:05.8577304Z #define __SEG_GS 1
2025-05-07T20:25:05.8577617Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:05.8577991Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:05.8578268Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:05.8578558Z #define __INT16_TYPE__ short int
2025-05-07T20:25:05.8578836Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:05.8579128Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:05.8579394Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:05.8579640Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:05.8579894Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:05.8580238Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:05.8580622Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:05.8580904Z #define linux 1
2025-05-07T20:25:05.8581132Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:05.8581409Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:05.8581676Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:05.8581928Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:05.8582190Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:05.8582446Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:05.8582790Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:05.8583204Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:05.8583634Z #define __code_model_small__ 1
2025-05-07T20:25:05.8583904Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:05.8584190Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:05.8584435Z #define __k8__ 1
2025-05-07T20:25:05.8584657Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:05.8584945Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:05.8585242Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:05.8585482Z #define __pic__ 2
2025-05-07T20:25:05.8585735Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:05.8586046Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:05.8586333Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:05.8586663Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:05.8587033Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:05.8587532Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:05.8587837Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:05.8588139Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:05.8588452Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:05.8588699Z #define __linux__ 1
2025-05-07T20:25:05.8588929Z #define __INT64_TYPE__ long int
2025-05-07T20:25:05.8589197Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:05.8589452Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:05.8589869Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:05.8590133Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:05.8590424Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:05.8590753Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:05.8591054Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:05.8591319Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:05.8591612Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:05.8591911Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:05.8592238Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:05.8592603Z #define __SSE__ 1
2025-05-07T20:25:05.8592835Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:05.8593181Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:05.8593518Z #define __amd64__ 1
2025-05-07T20:25:05.8593743Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:05.8593993Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:05.8594256Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:05.8594528Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:05.8594799Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:05.8595064Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:05.8595325Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:05.8595601Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:05.8595863Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:05.8596221Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:05.8596687Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:05.8597046Z #define _LP64 1
2025-05-07T20:25:05.8597256Z #define __UINT8_C(c) c
2025-05-07T20:25:05.8597515Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:05.8597815Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:05.8598079Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:05.8598355Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:05.8598659Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:05.8599015Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:05.8599481Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:05.8599857Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:05.8600145Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:05.8600463Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:05.8600830Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:05.8601197Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:05.8601466Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:05.8601807Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:05.8602289Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:05.8602549Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:05.8602806Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:05.8603058Z #define __FXSR__ 1
2025-05-07T20:25:05.8603357Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:05.8604257Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:05.8604680Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:05.8605002Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:05.8605257Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:05.8605589Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:05.8605941Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:05.8606188Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:05.8606431Z #define __PIC__ 2
2025-05-07T20:25:05.8606906Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:05.8607306Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:05.8607689Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:05.8608060Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:05.8608397Z #define __SSE2__ 1
2025-05-07T20:25:05.8608615Z #define __INT32_TYPE__ int
2025-05-07T20:25:05.8608852Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:05.8609103Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:05.8609434Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:05.8609774Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:05.8610045Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:05.8610311Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:05.8610580Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:05.8610847Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:05.8611091Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:05.8611337Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:05.8611614Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:05.8611911Z #define __PIE__ 2
2025-05-07T20:25:05.8612227Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:05.8612605Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:05.8612949Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:05.8613307Z #define __INT16_C(c) c
2025-05-07T20:25:05.8613520Z #define __STDC__ 1
2025-05-07T20:25:05.8613748Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:05.8614016Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:05.8614270Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:05.8614561Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:05.8614905Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:05.8615231Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:05.8615488Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:05.8615767Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:05.8616029Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:05.8616306Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:05.8616597Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:05.8616868Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:05.8617159Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:05.8617555Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:05.8617929Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:05.8618229Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:05.8618514Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:05.8618762Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:05.8618919Z 
2025-05-07T20:25:05.9122212Z 
2025-05-07T20:25:05.9122802Z [INFO] Printing out all preprocessor defines in the C++ compiler ...
2025-05-07T20:25:05.9123425Z + conda run -n build_binary c++ -dM -E -x c++ -
2025-05-07T20:25:05.9123733Z 
2025-05-07T20:25:07.8030701Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:07.8031099Z #define __cpp_attributes 200809L
2025-05-07T20:25:07.8031736Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:25:07.8032188Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:07.8032568Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:07.8032828Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:07.8033159Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:07.8033506Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:07.8033776Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:25:07.8034084Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:07.8034389Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:07.8034650Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:07.8034893Z #define __CHAR_BIT__ 8
2025-05-07T20:25:07.8035129Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:07.8035375Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:07.8035619Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:07.8035889Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:07.8036328Z #define __cpp_static_assert 201411L
2025-05-07T20:25:07.8036612Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:07.8036910Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.8037209Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:07.8037490Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:07.8037816Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:07.8038184Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:07.8038572Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:07.8038980Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:07.8039287Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:07.8039565Z #define __GCC_IEC_559 2
2025-05-07T20:25:07.8039802Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:07.8040077Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:07.8040352Z #define __cpp_binary_literals 201304L
2025-05-07T20:25:07.8040637Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:07.8040928Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:25:07.8041426Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:07.8041730Z #define __cpp_variadic_templates 200704L
2025-05-07T20:25:07.8042061Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.8042382Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:07.8042646Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:07.8042926Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:07.8043204Z #define __cpp_variable_templates 201304L
2025-05-07T20:25:07.8043499Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:07.8043756Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:07.8044013Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:07.8044284Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:25:07.8044605Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:25:07.8044929Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:07.8045178Z #define __INT8_C(c) c
2025-05-07T20:25:07.8045412Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:07.8045681Z #define __cpp_variadic_using 201611L
2025-05-07T20:25:07.8046003Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.8046317Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:07.8046586Z #define __cpp_capture_star_this 201603L
2025-05-07T20:25:07.8046873Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:07.8047178Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:07.8047528Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:07.8047807Z #define __cpp_if_constexpr 201606L
2025-05-07T20:25:07.8048083Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:07.8048339Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.8048613Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:07.8048885Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:07.8049266Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:07.8049673Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:07.8050112Z #define __linux 1
2025-05-07T20:25:07.8050334Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:07.8050711Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:07.8050993Z #define __unix 1
2025-05-07T20:25:07.8051212Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:07.8051491Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:25:07.8051775Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:07.8052041Z #define __WINT_MIN__ 0U
2025-05-07T20:25:07.8052277Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:07.8052557Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:07.8052832Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:07.8053094Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:07.8053345Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:07.8053624Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:07.8053912Z #define __INT64_C(c) c ## L
2025-05-07T20:25:07.8054179Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:07.8054557Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:07.8054823Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:07.8055130Z #define __cpp_aligned_new 201606L
2025-05-07T20:25:07.8055404Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:07.8055659Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:07.8056007Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:07.8056379Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:07.8056634Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:07.8056901Z #define __cpp_decltype_auto 201304L
2025-05-07T20:25:07.8057176Z #define __DBL_DIG__ 15
2025-05-07T20:25:07.8057410Z #define __FLT32_DIG__ 6
2025-05-07T20:25:07.8057706Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:07.8058046Z #define __GXX_WEAK__ 1
2025-05-07T20:25:07.8058278Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:07.8058591Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:07.8058912Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:07.8059261Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:07.8059515Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:07.8059813Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:25:07.8060138Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:25:07.8060539Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:07.8060923Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:07.8061199Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:07.8061451Z #define __unix__ 1
2025-05-07T20:25:07.8061666Z #define __INT_WIDTH__ 32
2025-05-07T20:25:07.8061907Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:07.8062151Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:07.8062394Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:07.8062657Z #define __UINT16_C(c) c
2025-05-07T20:25:07.8062895Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:07.8063139Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:07.8063489Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:07.8063851Z #define __gnu_linux__ 1
2025-05-07T20:25:07.8064092Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:07.8064345Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:07.8064638Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:07.8073141Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.8073430Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:07.8073690Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:07.8073945Z #define __GNUC__ 11
2025-05-07T20:25:07.8074163Z #define __GXX_RTTI 1
2025-05-07T20:25:07.8074386Z #define __pie__ 2
2025-05-07T20:25:07.8074601Z #define __MMX__ 1
2025-05-07T20:25:07.8074829Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:07.8075095Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:07.8075381Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:07.8075650Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:07.8075899Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:07.8076204Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:25:07.8076537Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:07.8077012Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:07.8077385Z #define __cpp_raw_strings 200710L
2025-05-07T20:25:07.8077690Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.8078007Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:07.8078314Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:07.8078592Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:07.8078892Z #define __cpp_fold_expressions 201603L
2025-05-07T20:25:07.8079187Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:07.8079459Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:07.8079719Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:07.8079996Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:07.8080288Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:07.8080558Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:07.8080830Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:07.8081196Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:07.8081456Z #define __cplusplus 201703L
2025-05-07T20:25:07.8081729Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:25:07.8082011Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:07.8082257Z #define __DEPRECATED 1
2025-05-07T20:25:07.8082509Z #define __cpp_rvalue_references 200610L
2025-05-07T20:25:07.8082801Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:07.8083050Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:07.8083366Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:07.8083722Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:07.8083985Z #define __SSE2_MATH__ 1
2025-05-07T20:25:07.8084229Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:07.8084529Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.8084814Z #define __amd64 1
2025-05-07T20:25:07.8085035Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:07.8085304Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:07.8085565Z #define __GNUG__ 11
2025-05-07T20:25:07.8085823Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:07.8086131Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:07.8086387Z #define __cpp_nsdmi 200809L
2025-05-07T20:25:07.8086640Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:07.8086916Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:07.8087169Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:07.8087437Z #define __cpp_initializer_lists 200806L
2025-05-07T20:25:07.8087729Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:07.8087991Z #define __cpp_hex_float 201603L
2025-05-07T20:25:07.8088251Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:07.8088513Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:07.8088789Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:07.8089055Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:07.8089321Z #define __x86_64 1
2025-05-07T20:25:07.8089550Z #define __cpp_lambdas 200907L
2025-05-07T20:25:07.8089813Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:07.8090186Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:07.8090574Z #define __cpp_template_auto 201606L
2025-05-07T20:25:07.8090933Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:07.8091371Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:07.8091836Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:07.8092218Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:07.8092460Z #define __LP64__ 1
2025-05-07T20:25:07.8092688Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.8093033Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:07.8093401Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:07.8093671Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:07.8093954Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:07.8094219Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:07.8094484Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:07.8094748Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:07.8095009Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:07.8095442Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:07.8095799Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:07.8096071Z #define __FLT_DIG__ 6
2025-05-07T20:25:07.8096294Z #define __NO_INLINE__ 1
2025-05-07T20:25:07.8096532Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:07.8096855Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:07.8097191Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:07.8097443Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:07.8097703Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:07.8097976Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:07.8098276Z #define __cpp_unicode_characters 201411L
2025-05-07T20:25:07.8098569Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:07.8098816Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:07.8099107Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:07.8099464Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:07.8099726Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:07.8100030Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:07.8100366Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:25:07.8100650Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:07.8100910Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:07.8101164Z #define __FLT128_DIG__ 33
2025-05-07T20:25:07.8101400Z #define __INT32_C(c) c
2025-05-07T20:25:07.8101634Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:07.8101913Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:07.8102191Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:07.8102465Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:07.8102776Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:07.8103079Z #define unix 1
2025-05-07T20:25:07.8103292Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:07.8103552Z #define __cpp_rtti 199711L
2025-05-07T20:25:07.8104197Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:07.8104512Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.8104811Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:07.8105110Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:07.8105428Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:07.8105668Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:07.8105947Z #define __cpp_digit_separators 201309L
2025-05-07T20:25:07.8106223Z #define __ELF__ 1
2025-05-07T20:25:07.8106444Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:07.8106716Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:07.8106979Z #define __FLT_RADIX__ 2
2025-05-07T20:25:07.8107209Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:07.8107559Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:07.8107914Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:07.8108178Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:25:07.8108483Z #define __k8 1
2025-05-07T20:25:07.8108783Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:07.8109152Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:07.8109431Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:07.8109775Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:07.8110028Z #define __LDBL_DIG__ 18
2025-05-07T20:25:07.8110256Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:07.8110506Z #define __x86_64__ 1
2025-05-07T20:25:07.8110736Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:07.8111019Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:07.8111349Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.8111645Z #define __FLT64_DIG__ 15
2025-05-07T20:25:07.8111913Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.8112253Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:07.8112557Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.8112815Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:07.8113082Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.8113370Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:07.8113873Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:07.8114257Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:07.8114539Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:07.8114852Z #define __cpp_unicode_literals 200710L
2025-05-07T20:25:07.8115154Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:07.8115471Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:07.8115761Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:07.8116029Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:07.8116326Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:07.8116599Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:07.8116831Z #define __SEG_FS 1
2025-05-07T20:25:07.8117049Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:07.8117321Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:07.8117589Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.8117987Z #define __SEG_GS 1
2025-05-07T20:25:07.8118297Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:07.8118666Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:07.8118930Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:07.8119211Z #define __INT16_TYPE__ short int
2025-05-07T20:25:07.8119479Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:07.8119771Z #define __cpp_structured_bindings 201606L
2025-05-07T20:25:07.8120062Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:07.8120302Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:07.8120548Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:07.8120881Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:07.8121253Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.8121557Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:25:07.8121867Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:25:07.8122161Z #define linux 1
2025-05-07T20:25:07.8122381Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.8122643Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:07.8122915Z #define __EXCEPTIONS 1
2025-05-07T20:25:07.8123156Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:07.8123404Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:07.8123666Z #define __cpp_range_based_for 201603L
2025-05-07T20:25:07.8123946Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:07.8124278Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:07.8124657Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:25:07.8124997Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:07.8125317Z #define __code_model_small__ 1
2025-05-07T20:25:07.8125577Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:07.8125872Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:25:07.8126168Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:07.8126435Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:25:07.8126723Z #define __k8__ 1
2025-05-07T20:25:07.8126943Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:07.8127216Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:07.8127509Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:07.8127797Z #define __pic__ 2
2025-05-07T20:25:07.8128091Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.8128471Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:07.8128794Z #define __cpp_decltype 200707L
2025-05-07T20:25:07.8129148Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.8129548Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:07.8130003Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:07.8130403Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:07.8130692Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:07.8131003Z #define __cpp_inline_variables 201606L
2025-05-07T20:25:07.8131292Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:07.8131541Z #define __linux__ 1
2025-05-07T20:25:07.8131769Z #define __INT64_TYPE__ long int
2025-05-07T20:25:07.8132026Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:07.8132367Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:07.8132631Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:07.8132899Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:25:07.8133207Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:07.8133501Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.8133800Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:07.8134062Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:07.8134350Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:07.8134639Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:07.8134954Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:07.8135301Z #define __SSE__ 1
2025-05-07T20:25:07.8135523Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:07.8135852Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:07.8136266Z #define __amd64__ 1
2025-05-07T20:25:07.8136490Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:07.8136732Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:07.8137004Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:07.8137263Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:07.8137520Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:07.8137805Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:07.8138132Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:07.8138453Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:07.8138874Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:07.8139434Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:07.8139859Z #define _LP64 1
2025-05-07T20:25:07.8140067Z #define __UINT8_C(c) c
2025-05-07T20:25:07.8140301Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:07.8140569Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:07.8140825Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:07.8141090Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:07.8141446Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:07.8141896Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:07.8142260Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.8142547Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.8142848Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:07.8143142Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:25:07.8143513Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:07.8143872Z #define __STDCPP_THREADS__ 1
2025-05-07T20:25:07.8144128Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:07.8144385Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:07.8144719Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:07.8145073Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:07.8145329Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:07.8145577Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:07.8145817Z #define __FXSR__ 1
2025-05-07T20:25:07.8146118Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:07.8146561Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:07.8146959Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:07.8147250Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:07.8147508Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:25:07.8147802Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:07.8148081Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:07.8148342Z #define __cpp_alias_templates 200704L
2025-05-07T20:25:07.8148692Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:07.8149043Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:07.8149303Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:07.8149542Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:07.8149823Z #define __PIC__ 2
2025-05-07T20:25:07.8150072Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:07.8150566Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:07.8150942Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:07.8151260Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:07.8151597Z #define __cpp_constexpr 201603L
2025-05-07T20:25:07.8151849Z #define __SSE2__ 1
2025-05-07T20:25:07.8152067Z #define __cpp_deduction_guides 201703L
2025-05-07T20:25:07.8152342Z #define __INT32_TYPE__ int
2025-05-07T20:25:07.8152581Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:07.8152826Z #define __cpp_exceptions 199711L
2025-05-07T20:25:07.8153092Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:07.8153417Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:07.8153755Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:07.8154015Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:07.8154275Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:07.8154661Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.8154925Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:07.8155171Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:07.8155414Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:25:07.8155691Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:07.8155972Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.8156256Z #define __PIE__ 2
2025-05-07T20:25:07.8156562Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:07.8156961Z #define __cpp_template_template_args 201611L
2025-05-07T20:25:07.8157261Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:07.8157588Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:07.8157962Z #define __INT16_C(c) c
2025-05-07T20:25:07.8158206Z #define __STDC__ 1
2025-05-07T20:25:07.8158410Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:07.8158656Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:07.8158926Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:07.8159167Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:07.8159459Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:07.8159794Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:07.8160117Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:07.8160367Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:07.8160645Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:25:07.8160911Z #define __SSE_MATH__ 1
2025-05-07T20:25:07.8161133Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:07.8161405Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:25:07.8161703Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:07.8161967Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:07.8162249Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.8162512Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:07.8162792Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.8163174Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:07.8163537Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:07.8163830Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:07.8164103Z #define _GNU_SOURCE 1
2025-05-07T20:25:07.8164339Z #define __cpp_init_captures 201304L
2025-05-07T20:25:07.8164605Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:07.8164837Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:07.8164994Z 
2025-05-07T20:25:07.8650497Z 
2025-05-07T20:25:07.8650828Z + conda run -n build_binary c++ --version
2025-05-07T20:25:07.8651102Z 
2025-05-07T20:25:09.7510390Z c++ (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:25:09.7510916Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:25:09.7511499Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:25:09.7512027Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:25:09.7512360Z 
2025-05-07T20:25:09.7512364Z 
2025-05-07T20:25:09.8126422Z 
2025-05-07T20:25:09.8126884Z [INFO] Printing the default version of the C standard used by the compiler ...
2025-05-07T20:25:09.8127904Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__
2025-05-07T20:25:09.8128224Z 
2025-05-07T20:25:11.7575784Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:11.7578088Z 
2025-05-07T20:25:11.7578665Z [INFO] Printing the default version of the C++ standard used by the compiler ...
2025-05-07T20:25:11.7579249Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus
2025-05-07T20:25:11.7579578Z 
2025-05-07T20:25:13.6972637Z #define __cplusplus 201703L
2025-05-07T20:25:13.6975271Z 
2025-05-07T20:25:13.6976771Z [INSTALL] Successfully installed C/C++ compilers
2025-05-07T20:25:13.7022838Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.6.3
2025-05-07T20:25:13.7023263Z [36;1m. $PRELUDE; install_cuda $BUILD_ENV 12.6.3[0m
2025-05-07T20:25:13.7035485Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:25:13.7035988Z env:
2025-05-07T20:25:13.7036212Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:25:13.7036508Z   BUILD_ENV: build_binary
2025-05-07T20:25:13.7036754Z   BUILD_TARGET: genai
2025-05-07T20:25:13.7036980Z   BUILD_VARIANT: cuda
2025-05-07T20:25:13.7037210Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:25:13.7037456Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:25:13.7037754Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:25:13.7038086Z ##[endgroup]
2025-05-07T20:25:14.0357694Z ################################################################################
2025-05-07T20:25:14.0358065Z # Install CUDA
2025-05-07T20:25:14.0358267Z #
2025-05-07T20:25:14.0374259Z # [2025-05-07T20:25:14.037Z] + install_cuda build_binary 12.6.3
2025-05-07T20:25:14.0374649Z ################################################################################
2025-05-07T20:25:14.0374867Z 
2025-05-07T20:25:14.0389523Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:25:14.1340916Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:25:14.1341853Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:25:14.1345248Z + conda clean --packages --tarball -y
2025-05-07T20:25:14.1345744Z 
2025-05-07T20:25:14.8414031Z Will remove 32 (140.4 MB) tarball(s).
2025-05-07T20:25:14.8414366Z Will remove 6 (617 KB) package(s).
2025-05-07T20:25:14.9033599Z 
2025-05-07T20:25:14.9043763Z + conda clean --all -y
2025-05-07T20:25:14.9044013Z 
2025-05-07T20:25:15.5754008Z There are no unused tarball(s) to remove.
2025-05-07T20:25:15.5754344Z Will remove 1 index cache(s).
2025-05-07T20:25:15.5754638Z There are no unused package(s) to remove.
2025-05-07T20:25:15.5754946Z There are no tempfile(s) to remove.
2025-05-07T20:25:15.5755262Z There are no logfile(s) to remove.
2025-05-07T20:25:15.6370551Z 
2025-05-07T20:25:15.6384825Z [INSTALL] Installing CUDA 12.6.3 ...
2025-05-07T20:25:15.6408361Z [EXEC] [ATTEMPT 0/3]    + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.6.3
2025-05-07T20:25:16.5506490Z Channels:
2025-05-07T20:25:16.5506732Z  - conda-forge
2025-05-07T20:25:16.5506960Z Platform: linux-64
2025-05-07T20:25:27.0679357Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:25:28.1693859Z Solving environment: - \ | / done
2025-05-07T20:25:28.2426888Z 
2025-05-07T20:25:28.2427697Z ## Package Plan ##
2025-05-07T20:25:28.2428124Z 
2025-05-07T20:25:28.2428528Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:28.2429123Z 
2025-05-07T20:25:28.2429310Z   added / updated specs:
2025-05-07T20:25:28.2430095Z     - cuda=12.6.3
2025-05-07T20:25:28.2430362Z 
2025-05-07T20:25:28.2430405Z 
2025-05-07T20:25:28.2430652Z The following packages will be downloaded:
2025-05-07T20:25:28.2431075Z 
2025-05-07T20:25:28.2431291Z     package                    |            build
2025-05-07T20:25:28.2431910Z     ---------------------------|-----------------
2025-05-07T20:25:28.2432650Z     alsa-lib-1.2.14            |       hb9d3cd8_0         553 KB  conda-forge
2025-05-07T20:25:28.2433057Z     attr-2.5.1                 |       h166bdaf_1          69 KB  conda-forge
2025-05-07T20:25:28.2433459Z     binutils-2.40              |       h4852527_7          31 KB  conda-forge
2025-05-07T20:25:28.2433861Z     bzip2-1.0.8                |       h4bc722e_7         247 KB  conda-forge
2025-05-07T20:25:28.2434267Z     c-compiler-1.5.2           |       h0b41bf4_0           6 KB  conda-forge
2025-05-07T20:25:28.2434664Z     cuda-12.6.3                |       ha804496_0          26 KB  conda-forge
2025-05-07T20:25:28.2435501Z     cuda-cccl_linux-64-12.6.77 |       ha770c72_0         1.0 MB  conda-forge
2025-05-07T20:25:28.2435999Z     cuda-command-line-tools-12.6.3|       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:28.2436479Z     cuda-compiler-12.6.3       |       hbad6d8a_0          20 KB  conda-forge
2025-05-07T20:25:28.2436948Z     cuda-crt-dev_linux-64-12.6.85|       ha770c72_0          87 KB  conda-forge
2025-05-07T20:25:28.2437575Z     cuda-crt-tools-12.6.85     |       ha770c72_0          26 KB  conda-forge
2025-05-07T20:25:28.2438025Z     cuda-cudart-12.6.77        |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:25:28.2438484Z     cuda-cudart-dev-12.6.77    |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:25:28.2438974Z     cuda-cudart-dev_linux-64-12.6.77|       h3f2d84a_0         357 KB  conda-forge
2025-05-07T20:25:28.2439472Z     cuda-cudart-static-12.6.77 |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:25:28.2439980Z     cuda-cudart-static_linux-64-12.6.77|       h3f2d84a_0         744 KB  conda-forge
2025-05-07T20:25:28.2440485Z     cuda-cudart_linux-64-12.6.77|       h3f2d84a_0         184 KB  conda-forge
2025-05-07T20:25:28.2440955Z     cuda-cuobjdump-12.6.77     |       hbd13f7d_1         241 KB  conda-forge
2025-05-07T20:25:28.2441405Z     cuda-cupti-12.6.80         |       hbd13f7d_0         1.9 MB  conda-forge
2025-05-07T20:25:28.2441849Z     cuda-cupti-dev-12.6.80     |       h5888daf_0         3.4 MB  conda-forge
2025-05-07T20:25:28.2442304Z     cuda-cuxxfilt-12.6.77      |       hbd13f7d_1         211 KB  conda-forge
2025-05-07T20:25:28.2442758Z     cuda-driver-dev-12.6.77    |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:25:28.2443242Z     cuda-driver-dev_linux-64-12.6.77|       h3f2d84a_0          35 KB  conda-forge
2025-05-07T20:25:28.2443696Z     cuda-gdb-12.6.77           |       h50b4baa_1         370 KB  conda-forge
2025-05-07T20:25:28.2444135Z     cuda-libraries-12.6.3      |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:28.2444608Z     cuda-libraries-dev-12.6.3  |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:28.2445061Z     cuda-nsight-12.6.77        |       h7938cbb_0       113.2 MB  conda-forge
2025-05-07T20:25:28.2445486Z     cuda-nvcc-12.6.85          |       hcdd1206_0          23 KB  conda-forge
2025-05-07T20:25:28.2445937Z     cuda-nvcc-dev_linux-64-12.6.85|       he91c749_0        10.8 MB  conda-forge
2025-05-07T20:25:28.2446405Z     cuda-nvcc-impl-12.6.85     |       h85509e4_0          25 KB  conda-forge
2025-05-07T20:25:28.2446852Z     cuda-nvcc-tools-12.6.85    |       he02047a_0        23.0 MB  conda-forge
2025-05-07T20:25:28.2447310Z     cuda-nvcc_linux-64-12.6.85 |       h04802cd_0          25 KB  conda-forge
2025-05-07T20:25:28.2447764Z     cuda-nvdisasm-12.6.77      |       hbd13f7d_1        47.6 MB  conda-forge
2025-05-07T20:25:28.2448210Z     cuda-nvml-dev-12.6.77      |       hbd13f7d_1         159 KB  conda-forge
2025-05-07T20:25:28.2448644Z     cuda-nvprof-12.6.80        |       hbd13f7d_0         2.6 MB  conda-forge
2025-05-07T20:25:28.2449092Z     cuda-nvprune-12.6.77       |       hbd13f7d_1          66 KB  conda-forge
2025-05-07T20:25:28.2449526Z     cuda-nvrtc-12.6.85         |       hbd13f7d_0        17.3 MB  conda-forge
2025-05-07T20:25:28.2449958Z     cuda-nvrtc-dev-12.6.85     |       h5888daf_0          31 KB  conda-forge
2025-05-07T20:25:28.2450397Z     cuda-nvtx-12.6.77          |       hbd13f7d_0          31 KB  conda-forge
2025-05-07T20:25:28.2450849Z     cuda-nvvm-dev_linux-64-12.6.85|       ha770c72_0          25 KB  conda-forge
2025-05-07T20:25:28.2451309Z     cuda-nvvm-impl-12.6.85     |       he02047a_0         7.7 MB  conda-forge
2025-05-07T20:25:28.2451756Z     cuda-nvvm-tools-12.6.85    |       he02047a_0        10.4 MB  conda-forge
2025-05-07T20:25:28.2452193Z     cuda-nvvp-12.6.80          |       hbd13f7d_1       109.3 MB  conda-forge
2025-05-07T20:25:28.2452616Z     cuda-opencl-12.6.77        |       hbd13f7d_0          29 KB  conda-forge
2025-05-07T20:25:28.2453180Z     cuda-opencl-dev-12.6.77    |       h5888daf_0          93 KB  conda-forge
2025-05-07T20:25:28.2453652Z     cuda-profiler-api-12.6.77  |       h7938cbb_0          22 KB  conda-forge
2025-05-07T20:25:28.2454113Z     cuda-runtime-12.6.3        |       ha804496_0          19 KB  conda-forge
2025-05-07T20:25:28.2454572Z     cuda-sanitizer-api-12.6.77 |       hbd13f7d_1         8.9 MB  conda-forge
2025-05-07T20:25:28.2455107Z     cuda-toolkit-12.6.3        |       ha804496_0          19 KB  conda-forge
2025-05-07T20:25:28.2455539Z     cuda-tools-12.6.3          |       ha770c72_0          19 KB  conda-forge
2025-05-07T20:25:28.2455963Z     cuda-version-12.6          |       h7480c83_3          20 KB  conda-forge
2025-05-07T20:25:28.2456411Z     cuda-visual-tools-12.6.3   |       ha770c72_0          19 KB  conda-forge
2025-05-07T20:25:28.2456864Z     cxx-compiler-1.5.2         |       hf52228f_0           6 KB  conda-forge
2025-05-07T20:25:28.2457271Z     dbus-1.13.6                |       h5008d03_3         604 KB  conda-forge
2025-05-07T20:25:28.2457663Z     expat-2.7.0                |       h5888daf_0         137 KB  conda-forge
2025-05-07T20:25:28.2458117Z     font-ttf-dejavu-sans-mono-2.37|       hab24e00_0         388 KB  conda-forge
2025-05-07T20:25:28.2458634Z     font-ttf-inconsolata-3.000 |       h77eed37_0          94 KB  conda-forge
2025-05-07T20:25:28.2459150Z     font-ttf-source-code-pro-2.038|       h77eed37_0         684 KB  conda-forge
2025-05-07T20:25:28.2459643Z     font-ttf-ubuntu-0.83       |       h77eed37_3         1.5 MB  conda-forge
2025-05-07T20:25:28.2460077Z     fontconfig-2.15.0          |       h7e30c49_1         259 KB  conda-forge
2025-05-07T20:25:28.2460538Z     fonts-conda-ecosystem-1    |                0           4 KB  conda-forge
2025-05-07T20:25:28.2461007Z     fonts-conda-forge-1        |                0           4 KB  conda-forge
2025-05-07T20:25:28.2461435Z     freetype-2.13.3            |       ha770c72_1         168 KB  conda-forge
2025-05-07T20:25:28.2461826Z     gcc-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:25:28.2462228Z     gds-tools-1.11.1.6         |       h5888daf_4        37.8 MB  conda-forge
2025-05-07T20:25:28.2462623Z     gmp-6.3.0                  |       hac33072_2         449 KB  conda-forge
2025-05-07T20:25:28.2462996Z     gxx-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:25:28.2463393Z     keyutils-1.6.1             |       h166bdaf_0         115 KB  conda-forge
2025-05-07T20:25:28.2463788Z     krb5-1.21.3                |       h659f571_0         1.3 MB  conda-forge
2025-05-07T20:25:28.2464173Z     libcap-2.71                |       h39aace5_0         100 KB  conda-forge
2025-05-07T20:25:28.2464591Z     libcublas-12.6.4.1         |       h5888daf_1       256.2 MB  conda-forge
2025-05-07T20:25:28.2465034Z     libcublas-dev-12.6.4.1     |       h5888daf_1          88 KB  conda-forge
2025-05-07T20:25:28.2465473Z     libcufft-11.3.0.4          |       hbd13f7d_0       156.2 MB  conda-forge
2025-05-07T20:25:28.2465905Z     libcufft-dev-11.3.0.4      |       h5888daf_0          33 KB  conda-forge
2025-05-07T20:25:28.2466340Z     libcufile-1.11.1.6         |       h12f29b5_4         900 KB  conda-forge
2025-05-07T20:25:28.2466783Z     libcufile-dev-1.11.1.6     |       h5888daf_4          35 KB  conda-forge
2025-05-07T20:25:28.2467225Z     libcurand-10.3.7.77        |       hbd13f7d_0        39.9 MB  conda-forge
2025-05-07T20:25:28.2467668Z     libcurand-dev-10.3.7.77    |       h5888daf_0         262 KB  conda-forge
2025-05-07T20:25:28.2468117Z     libcusolver-11.7.1.2       |       h5888daf_1        95.8 MB  conda-forge
2025-05-07T20:25:28.2468573Z     libcusolver-dev-11.7.1.2   |       h5888daf_1          59 KB  conda-forge
2025-05-07T20:25:28.2469027Z     libcusparse-12.5.4.2       |       hbd13f7d_0       118.6 MB  conda-forge
2025-05-07T20:25:28.2469483Z     libcusparse-dev-12.5.4.2   |       h5888daf_0          51 KB  conda-forge
2025-05-07T20:25:28.2470020Z     libedit-3.1.20191231       |       he28a2e2_2         121 KB  conda-forge
2025-05-07T20:25:28.2470605Z     libexpat-2.7.0             |       h5888daf_0          73 KB  conda-forge
2025-05-07T20:25:28.2471098Z     libfreetype-2.13.3         |       ha770c72_1           8 KB  conda-forge
2025-05-07T20:25:28.2471610Z     libfreetype6-2.13.3        |       h48d6fc4_1         371 KB  conda-forge
2025-05-07T20:25:28.2472203Z     libgcrypt-lib-1.11.0       |       hb9d3cd8_2         572 KB  conda-forge
2025-05-07T20:25:28.2472705Z     libglib-2.84.0             |       h2ff4ddf_0         3.8 MB  conda-forge
2025-05-07T20:25:28.2473247Z     libgpg-error-1.55          |       h3f2d84a_0         305 KB  conda-forge
2025-05-07T20:25:28.2473734Z     libiconv-1.18              |       h4ce23a2_1         696 KB  conda-forge
2025-05-07T20:25:28.2474185Z     libnl-3.11.0               |       hb9d3cd8_0         724 KB  conda-forge
2025-05-07T20:25:28.2474644Z     libnpp-12.3.1.54           |       h5888daf_0        93.4 MB  conda-forge
2025-05-07T20:25:28.2475133Z     libnpp-dev-12.3.1.54       |       h5888daf_0         441 KB  conda-forge
2025-05-07T20:25:28.2475612Z     libnsl-2.0.1               |       hd590300_0          33 KB  conda-forge
2025-05-07T20:25:28.2476063Z     libnuma-2.0.18             |       h4ab18f5_2          42 KB  conda-forge
2025-05-07T20:25:28.2476555Z     libnvfatbin-12.6.77        |       hbd13f7d_0         783 KB  conda-forge
2025-05-07T20:25:28.2477097Z     libnvfatbin-dev-12.6.77    |       h5888daf_0          26 KB  conda-forge
2025-05-07T20:25:28.2477630Z     libnvjitlink-12.6.85       |       hbd13f7d_0        14.9 MB  conda-forge
2025-05-07T20:25:28.2478169Z     libnvjitlink-dev-12.6.85   |       h5888daf_0          25 KB  conda-forge
2025-05-07T20:25:28.2478694Z     libnvjpeg-12.3.3.54        |       h5888daf_0         2.4 MB  conda-forge
2025-05-07T20:25:28.2479211Z     libnvjpeg-dev-12.3.3.54    |       ha770c72_0          31 KB  conda-forge
2025-05-07T20:25:28.2479696Z     libpng-1.6.47              |       h943b412_0         282 KB  conda-forge
2025-05-07T20:25:28.2480168Z     libsqlite-3.49.2           |       hee588c1_0         895 KB  conda-forge
2025-05-07T20:25:28.2480662Z     libsystemd0-256.9          |       h2774228_0         401 KB  conda-forge
2025-05-07T20:25:28.2481146Z     libudev1-257.4             |       h9a4d06a_0         140 KB  conda-forge
2025-05-07T20:25:28.2481621Z     libuuid-2.38.1             |       h0b41bf4_0          33 KB  conda-forge
2025-05-07T20:25:28.2482078Z     libxcb-1.17.0              |       h8a09558_0         387 KB  conda-forge
2025-05-07T20:25:28.2482561Z     libxkbcommon-1.8.0         |       hc4a0caf_0         627 KB  conda-forge
2025-05-07T20:25:28.2483065Z     libxkbfile-1.1.0           |       h166bdaf_1         111 KB  conda-forge
2025-05-07T20:25:28.2483590Z     libxml2-2.13.5             |       h064dc61_0         673 KB  conda-forge
2025-05-07T20:25:28.2484050Z     libzlib-1.3.1              |       hb9d3cd8_2          60 KB  conda-forge
2025-05-07T20:25:28.2484494Z     lz4-c-1.9.4                |       hcb278e6_0         140 KB  conda-forge
2025-05-07T20:25:28.2484993Z     nsight-compute-2024.3.2.3  |       hb5ebaad_0       443.1 MB  conda-forge
2025-05-07T20:25:28.2485490Z     nspr-4.36                  |       h5888daf_0         225 KB  conda-forge
2025-05-07T20:25:28.2485916Z     nss-3.111                  |       h159eef7_0         1.9 MB  conda-forge
2025-05-07T20:25:28.2486353Z     ocl-icd-2.3.3              |       hb9d3cd8_0         104 KB  conda-forge
2025-05-07T20:25:28.2486862Z     opencl-headers-2024.10.24  |       h5888daf_0          53 KB  conda-forge
2025-05-07T20:25:28.2487362Z     pcre2-10.44                |       hc749103_2         934 KB  conda-forge
2025-05-07T20:25:28.2487845Z     pthread-stubs-0.4          |    hb9d3cd8_1002           8 KB  conda-forge
2025-05-07T20:25:28.2488344Z     python-3.9.18              |h0755675_1_cpython        22.7 MB  conda-forge
2025-05-07T20:25:28.2488827Z     rdma-core-55.0             |       h5888daf_0         1.2 MB  conda-forge
2025-05-07T20:25:28.2489387Z     sqlite-3.32.3              |       hcee41ef_1         1.4 MB  conda-forge
2025-05-07T20:25:28.2489832Z     tk-8.6.13                  |noxft_h4845f30_101         3.2 MB  conda-forge
2025-05-07T20:25:28.2490283Z     wayland-1.23.1             |       h3e06ad9_0         314 KB  conda-forge
2025-05-07T20:25:28.2490788Z     xcb-util-0.4.1             |       hb711507_2          19 KB  conda-forge
2025-05-07T20:25:28.2491215Z     xcb-util-cursor-0.1.5      |       hb9d3cd8_0          20 KB  conda-forge
2025-05-07T20:25:28.2491658Z     xcb-util-image-0.4.0       |       hb711507_2          24 KB  conda-forge
2025-05-07T20:25:28.2492109Z     xcb-util-keysyms-0.4.1     |       hb711507_0          14 KB  conda-forge
2025-05-07T20:25:28.2492579Z     xcb-util-renderutil-0.3.10 |       hb711507_0          17 KB  conda-forge
2025-05-07T20:25:28.2493026Z     xcb-util-wm-0.4.2          |       hb711507_0          50 KB  conda-forge
2025-05-07T20:25:28.2493470Z     xkeyboard-config-2.44      |       hb9d3cd8_0         384 KB  conda-forge
2025-05-07T20:25:28.2493924Z     xorg-libice-1.1.2          |       hb9d3cd8_0          57 KB  conda-forge
2025-05-07T20:25:28.2494347Z     xorg-libsm-1.2.6           |       he73a12e_0          27 KB  conda-forge
2025-05-07T20:25:28.2494767Z     xorg-libx11-1.8.12         |       h4f16b4b_0         816 KB  conda-forge
2025-05-07T20:25:28.2495204Z     xorg-libxau-1.0.12         |       hb9d3cd8_0          14 KB  conda-forge
2025-05-07T20:25:28.2495669Z     xorg-libxcomposite-0.4.6   |       hb9d3cd8_2          13 KB  conda-forge
2025-05-07T20:25:28.2496139Z     xorg-libxdamage-1.1.6      |       hb9d3cd8_0          13 KB  conda-forge
2025-05-07T20:25:28.2496594Z     xorg-libxdmcp-1.1.5        |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:25:28.2497035Z     xorg-libxext-1.3.6         |       hb9d3cd8_0          49 KB  conda-forge
2025-05-07T20:25:28.2497481Z     xorg-libxfixes-6.0.1       |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:25:28.2497914Z     xorg-libxi-1.8.2           |       hb9d3cd8_0          46 KB  conda-forge
2025-05-07T20:25:28.2498356Z     xorg-libxrandr-1.5.4       |       hb9d3cd8_0          29 KB  conda-forge
2025-05-07T20:25:28.2498815Z     xorg-libxrender-0.9.12     |       hb9d3cd8_0          32 KB  conda-forge
2025-05-07T20:25:28.2499278Z     xorg-libxtst-1.2.5         |       hb9d3cd8_3          32 KB  conda-forge
2025-05-07T20:25:28.2499676Z     zlib-1.3.1                 |       hb9d3cd8_2          90 KB  conda-forge
2025-05-07T20:25:28.2500054Z     zstd-1.5.7                 |       hb8e6e7a_2         554 KB  conda-forge
2025-05-07T20:25:28.2500425Z     ------------------------------------------------------------
2025-05-07T20:25:28.2500758Z                                            Total:        1.63 GB
2025-05-07T20:25:28.2500972Z 
2025-05-07T20:25:28.2501098Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:28.2501319Z 
2025-05-07T20:25:28.2501523Z   alsa-lib           conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 
2025-05-07T20:25:28.2501940Z   attr               conda-forge/linux-64::attr-2.5.1-h166bdaf_1 
2025-05-07T20:25:28.2502347Z   binutils           conda-forge/linux-64::binutils-2.40-h4852527_7 
2025-05-07T20:25:28.2502774Z   bzip2              conda-forge/linux-64::bzip2-1.0.8-h4bc722e_7 
2025-05-07T20:25:28.2503215Z   c-compiler         conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 
2025-05-07T20:25:28.2503641Z   cuda               conda-forge/noarch::cuda-12.6.3-ha804496_0 
2025-05-07T20:25:28.2504349Z   cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.6.77-ha770c72_0 
2025-05-07T20:25:28.2505310Z   cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.6.3-ha770c72_0 
2025-05-07T20:25:28.2505923Z   cuda-compiler      conda-forge/noarch::cuda-compiler-12.6.3-hbad6d8a_0 
2025-05-07T20:25:28.2506463Z   cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.6.85-ha770c72_0 
2025-05-07T20:25:28.2507006Z   cuda-crt-tools     conda-forge/linux-64::cuda-crt-tools-12.6.85-ha770c72_0 
2025-05-07T20:25:28.2507749Z   cuda-cudart        conda-forge/linux-64::cuda-cudart-12.6.77-h5888daf_0 
2025-05-07T20:25:28.2508269Z   cuda-cudart-dev    conda-forge/linux-64::cuda-cudart-dev-12.6.77-h5888daf_0 
2025-05-07T20:25:28.2508828Z   cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:25:28.2511735Z   cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.6.77-h5888daf_0 
2025-05-07T20:25:28.2512354Z   cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:25:28.2512955Z   cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:25:28.2513506Z   cuda-cuobjdump     conda-forge/linux-64::cuda-cuobjdump-12.6.77-hbd13f7d_1 
2025-05-07T20:25:28.2514020Z   cuda-cupti         conda-forge/linux-64::cuda-cupti-12.6.80-hbd13f7d_0 
2025-05-07T20:25:28.2514523Z   cuda-cupti-dev     conda-forge/linux-64::cuda-cupti-dev-12.6.80-h5888daf_0 
2025-05-07T20:25:28.2515064Z   cuda-cuxxfilt      conda-forge/linux-64::cuda-cuxxfilt-12.6.77-hbd13f7d_1 
2025-05-07T20:25:28.2515590Z   cuda-driver-dev    conda-forge/linux-64::cuda-driver-dev-12.6.77-h5888daf_0 
2025-05-07T20:25:28.2516161Z   cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:25:28.2516693Z   cuda-gdb           conda-forge/linux-64::cuda-gdb-12.6.77-h50b4baa_1 
2025-05-07T20:25:28.2517179Z   cuda-libraries     conda-forge/linux-64::cuda-libraries-12.6.3-ha770c72_0 
2025-05-07T20:25:28.2517730Z   cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.6.3-ha770c72_0 
2025-05-07T20:25:28.2518267Z   cuda-nsight        conda-forge/linux-64::cuda-nsight-12.6.77-h7938cbb_0 
2025-05-07T20:25:28.2518740Z   cuda-nvcc          conda-forge/linux-64::cuda-nvcc-12.6.85-hcdd1206_0 
2025-05-07T20:25:28.2519258Z   cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.6.85-he91c749_0 
2025-05-07T20:25:28.2519802Z   cuda-nvcc-impl     conda-forge/linux-64::cuda-nvcc-impl-12.6.85-h85509e4_0 
2025-05-07T20:25:28.2520342Z   cuda-nvcc-tools    conda-forge/linux-64::cuda-nvcc-tools-12.6.85-he02047a_0 
2025-05-07T20:25:28.2520895Z   cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.6.85-h04802cd_0 
2025-05-07T20:25:28.2521429Z   cuda-nvdisasm      conda-forge/linux-64::cuda-nvdisasm-12.6.77-hbd13f7d_1 
2025-05-07T20:25:28.2521943Z   cuda-nvml-dev      conda-forge/linux-64::cuda-nvml-dev-12.6.77-hbd13f7d_1 
2025-05-07T20:25:28.2522442Z   cuda-nvprof        conda-forge/linux-64::cuda-nvprof-12.6.80-hbd13f7d_0 
2025-05-07T20:25:28.2522944Z   cuda-nvprune       conda-forge/linux-64::cuda-nvprune-12.6.77-hbd13f7d_1 
2025-05-07T20:25:28.2523440Z   cuda-nvrtc         conda-forge/linux-64::cuda-nvrtc-12.6.85-hbd13f7d_0 
2025-05-07T20:25:28.2523932Z   cuda-nvrtc-dev     conda-forge/linux-64::cuda-nvrtc-dev-12.6.85-h5888daf_0 
2025-05-07T20:25:28.2524423Z   cuda-nvtx          conda-forge/linux-64::cuda-nvtx-12.6.77-hbd13f7d_0 
2025-05-07T20:25:28.2524942Z   cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.6.85-ha770c72_0 
2025-05-07T20:25:28.2525496Z   cuda-nvvm-impl     conda-forge/linux-64::cuda-nvvm-impl-12.6.85-he02047a_0 
2025-05-07T20:25:28.2526027Z   cuda-nvvm-tools    conda-forge/linux-64::cuda-nvvm-tools-12.6.85-he02047a_0 
2025-05-07T20:25:28.2526532Z   cuda-nvvp          conda-forge/linux-64::cuda-nvvp-12.6.80-hbd13f7d_1 
2025-05-07T20:25:28.2527010Z   cuda-opencl        conda-forge/linux-64::cuda-opencl-12.6.77-hbd13f7d_0 
2025-05-07T20:25:28.2527521Z   cuda-opencl-dev    conda-forge/linux-64::cuda-opencl-dev-12.6.77-h5888daf_0 
2025-05-07T20:25:28.2528085Z   cuda-profiler-api  conda-forge/linux-64::cuda-profiler-api-12.6.77-h7938cbb_0 
2025-05-07T20:25:28.2528617Z   cuda-runtime       conda-forge/noarch::cuda-runtime-12.6.3-ha804496_0 
2025-05-07T20:25:28.2529158Z   cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.6.77-hbd13f7d_1 
2025-05-07T20:25:28.2529697Z   cuda-toolkit       conda-forge/noarch::cuda-toolkit-12.6.3-ha804496_0 
2025-05-07T20:25:28.2530281Z   cuda-tools         conda-forge/linux-64::cuda-tools-12.6.3-ha770c72_0 
2025-05-07T20:25:28.2530751Z   cuda-version       conda-forge/noarch::cuda-version-12.6-h7480c83_3 
2025-05-07T20:25:28.2531272Z   cuda-visual-tools  conda-forge/linux-64::cuda-visual-tools-12.6.3-ha770c72_0 
2025-05-07T20:25:28.2531879Z   cxx-compiler       conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 
2025-05-07T20:25:28.2532324Z   dbus               conda-forge/linux-64::dbus-1.13.6-h5008d03_3 
2025-05-07T20:25:28.2532728Z   expat              conda-forge/linux-64::expat-2.7.0-h5888daf_0 
2025-05-07T20:25:28.2533275Z   font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 
2025-05-07T20:25:28.2533865Z   font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 
2025-05-07T20:25:28.2534455Z   font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 
2025-05-07T20:25:28.2535018Z   font-ttf-ubuntu    conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 
2025-05-07T20:25:28.2535522Z   fontconfig         conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 
2025-05-07T20:25:28.2536008Z   fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 
2025-05-07T20:25:28.2536494Z   fonts-conda-forge  conda-forge/noarch::fonts-conda-forge-1-0 
2025-05-07T20:25:28.2536960Z   freetype           conda-forge/linux-64::freetype-2.13.3-ha770c72_1 
2025-05-07T20:25:28.2537369Z   gcc                conda-forge/linux-64::gcc-11.4.0-h602e360_13 
2025-05-07T20:25:28.2537786Z   gds-tools          conda-forge/linux-64::gds-tools-1.11.1.6-h5888daf_4 
2025-05-07T20:25:28.2538203Z   gmp                conda-forge/linux-64::gmp-6.3.0-hac33072_2 
2025-05-07T20:25:28.2538579Z   gxx                conda-forge/linux-64::gxx-11.4.0-h602e360_13 
2025-05-07T20:25:28.2538977Z   keyutils           conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 
2025-05-07T20:25:28.2539388Z   krb5               conda-forge/linux-64::krb5-1.21.3-h659f571_0 
2025-05-07T20:25:28.2539791Z   libcap             conda-forge/linux-64::libcap-2.71-h39aace5_0 
2025-05-07T20:25:28.2540230Z   libcublas          conda-forge/linux-64::libcublas-12.6.4.1-h5888daf_1 
2025-05-07T20:25:28.2540721Z   libcublas-dev      conda-forge/linux-64::libcublas-dev-12.6.4.1-h5888daf_1 
2025-05-07T20:25:28.2541211Z   libcufft           conda-forge/linux-64::libcufft-11.3.0.4-hbd13f7d_0 
2025-05-07T20:25:28.2541700Z   libcufft-dev       conda-forge/linux-64::libcufft-dev-11.3.0.4-h5888daf_0 
2025-05-07T20:25:28.2542183Z   libcufile          conda-forge/linux-64::libcufile-1.11.1.6-h12f29b5_4 
2025-05-07T20:25:28.2542731Z   libcufile-dev      conda-forge/linux-64::libcufile-dev-1.11.1.6-h5888daf_4 
2025-05-07T20:25:28.2543412Z   libcurand          conda-forge/linux-64::libcurand-10.3.7.77-hbd13f7d_0 
2025-05-07T20:25:28.2543917Z   libcurand-dev      conda-forge/linux-64::libcurand-dev-10.3.7.77-h5888daf_0 
2025-05-07T20:25:28.2544435Z   libcusolver        conda-forge/linux-64::libcusolver-11.7.1.2-h5888daf_1 
2025-05-07T20:25:28.2544966Z   libcusolver-dev    conda-forge/linux-64::libcusolver-dev-11.7.1.2-h5888daf_1 
2025-05-07T20:25:28.2545547Z   libcusparse        conda-forge/linux-64::libcusparse-12.5.4.2-hbd13f7d_0 
2025-05-07T20:25:28.2546081Z   libcusparse-dev    conda-forge/linux-64::libcusparse-dev-12.5.4.2-h5888daf_0 
2025-05-07T20:25:28.2546720Z   libedit            conda-forge/linux-64::libedit-3.1.20191231-he28a2e2_2 
2025-05-07T20:25:28.2547184Z   libexpat           conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 
2025-05-07T20:25:28.2547657Z   libfreetype        conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 
2025-05-07T20:25:28.2548162Z   libfreetype6       conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 
2025-05-07T20:25:28.2548667Z   libgcrypt-lib      conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 
2025-05-07T20:25:28.2549147Z   libglib            conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 
2025-05-07T20:25:28.2549606Z   libgpg-error       conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 
2025-05-07T20:25:28.2550285Z   libiconv           conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 
2025-05-07T20:25:28.2550702Z   libnl              conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 
2025-05-07T20:25:28.2551122Z   libnpp             conda-forge/linux-64::libnpp-12.3.1.54-h5888daf_0 
2025-05-07T20:25:28.2551655Z   libnpp-dev         conda-forge/linux-64::libnpp-dev-12.3.1.54-h5888daf_0 
2025-05-07T20:25:28.2552101Z   libnsl             conda-forge/linux-64::libnsl-2.0.1-hd590300_0 
2025-05-07T20:25:28.2552513Z   libnuma            conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 
2025-05-07T20:25:28.2552973Z   libnvfatbin        conda-forge/linux-64::libnvfatbin-12.6.77-hbd13f7d_0 
2025-05-07T20:25:28.2553495Z   libnvfatbin-dev    conda-forge/linux-64::libnvfatbin-dev-12.6.77-h5888daf_0 
2025-05-07T20:25:28.2554021Z   libnvjitlink       conda-forge/linux-64::libnvjitlink-12.6.85-hbd13f7d_0 
2025-05-07T20:25:28.2554554Z   libnvjitlink-dev   conda-forge/linux-64::libnvjitlink-dev-12.6.85-h5888daf_0 
2025-05-07T20:25:28.2555077Z   libnvjpeg          conda-forge/linux-64::libnvjpeg-12.3.3.54-h5888daf_0 
2025-05-07T20:25:28.2555577Z   libnvjpeg-dev      conda-forge/linux-64::libnvjpeg-dev-12.3.3.54-ha770c72_0 
2025-05-07T20:25:28.2556053Z   libpng             conda-forge/linux-64::libpng-1.6.47-h943b412_0 
2025-05-07T20:25:28.2556644Z   libsqlite          conda-forge/linux-64::libsqlite-3.49.2-hee588c1_0 
2025-05-07T20:25:28.2557220Z   libsystemd0        conda-forge/linux-64::libsystemd0-256.9-h2774228_0 
2025-05-07T20:25:28.2557673Z   libudev1           conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 
2025-05-07T20:25:28.2558098Z   libuuid            conda-forge/linux-64::libuuid-2.38.1-h0b41bf4_0 
2025-05-07T20:25:28.2558517Z   libxcb             conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 
2025-05-07T20:25:28.2558972Z   libxkbcommon       conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 
2025-05-07T20:25:28.2559452Z   libxkbfile         conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 
2025-05-07T20:25:28.2559894Z   libxml2            conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 
2025-05-07T20:25:28.2560314Z   libzlib            conda-forge/linux-64::libzlib-1.3.1-hb9d3cd8_2 
2025-05-07T20:25:28.2560722Z   lz4-c              conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 
2025-05-07T20:25:28.2561199Z   nsight-compute     conda-forge/linux-64::nsight-compute-2024.3.2.3-hb5ebaad_0 
2025-05-07T20:25:28.2561676Z   nspr               conda-forge/linux-64::nspr-4.36-h5888daf_0 
2025-05-07T20:25:28.2562049Z   nss                conda-forge/linux-64::nss-3.111-h159eef7_0 
2025-05-07T20:25:28.2562443Z   ocl-icd            conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 
2025-05-07T20:25:28.2562928Z   opencl-headers     conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 
2025-05-07T20:25:28.2563409Z   pcre2              conda-forge/linux-64::pcre2-10.44-hc749103_2 
2025-05-07T20:25:28.2563870Z   pthread-stubs      conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 
2025-05-07T20:25:28.2564358Z   rdma-core          conda-forge/linux-64::rdma-core-55.0-h5888daf_0 
2025-05-07T20:25:28.2564844Z   wayland            conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 
2025-05-07T20:25:28.2574908Z   xcb-util           conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 
2025-05-07T20:25:28.2575459Z   xcb-util-cursor    conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 
2025-05-07T20:25:28.2576003Z   xcb-util-image     conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 
2025-05-07T20:25:28.2576548Z   xcb-util-keysyms   conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 
2025-05-07T20:25:28.2577220Z   xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 
2025-05-07T20:25:28.2577852Z   xcb-util-wm        conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 
2025-05-07T20:25:28.2578357Z   xkeyboard-config   conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 
2025-05-07T20:25:28.2578878Z   xorg-libice        conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 
2025-05-07T20:25:28.2579528Z   xorg-libsm         conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 
2025-05-07T20:25:28.2580086Z   xorg-libx11        conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 
2025-05-07T20:25:28.2580555Z   xorg-libxau        conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 
2025-05-07T20:25:28.2581101Z   xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 
2025-05-07T20:25:28.2581769Z   xorg-libxdamage    conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 
2025-05-07T20:25:28.2582369Z   xorg-libxdmcp      conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 
2025-05-07T20:25:28.2582874Z   xorg-libxext       conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 
2025-05-07T20:25:28.2583388Z   xorg-libxfixes     conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 
2025-05-07T20:25:28.2583892Z   xorg-libxi         conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 
2025-05-07T20:25:28.2584378Z   xorg-libxrandr     conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 
2025-05-07T20:25:28.2584920Z   xorg-libxrender    conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 
2025-05-07T20:25:28.2585488Z   xorg-libxtst       conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 
2025-05-07T20:25:28.2585935Z   zstd               conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 
2025-05-07T20:25:28.2586180Z 
2025-05-07T20:25:28.2586295Z The following packages will be UPDATED:
2025-05-07T20:25:28.2586503Z 
2025-05-07T20:25:28.2586740Z   zlib                    pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.3.1-hb9d3cd8_2 
2025-05-07T20:25:28.2587072Z 
2025-05-07T20:25:28.2587286Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:25:28.2587594Z 
2025-05-07T20:25:28.2587876Z   python                pkgs/main::python-3.9.21-he870216_1 --> conda-forge::python-3.9.18-h0755675_1_cpython 
2025-05-07T20:25:28.2588544Z   sqlite                pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.32.3-hcee41ef_1 
2025-05-07T20:25:28.2589190Z   tk                        pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 
2025-05-07T20:25:28.2589516Z 
2025-05-07T20:25:28.2589520Z 
2025-05-07T20:25:28.2589524Z 
2025-05-07T20:25:28.2589664Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:28.2590145Z nsight-compute-2024. | 443.1 MB  |            |   0% 
2025-05-07T20:25:28.2590384Z 
2025-05-07T20:25:28.2590774Z libcublas-12.6.4.1   | 256.2 MB  |            |   0% [A
2025-05-07T20:25:28.2591010Z 
2025-05-07T20:25:28.2591014Z 
2025-05-07T20:25:28.2591220Z libcufft-11.3.0.4    | 156.2 MB  |            |   0% [A[A
2025-05-07T20:25:28.2591482Z 
2025-05-07T20:25:28.2591487Z 
2025-05-07T20:25:28.2591492Z 
2025-05-07T20:25:28.2591815Z libcusparse-12.5.4.2 | 118.6 MB  |            |   0% [A[A[A
2025-05-07T20:25:28.2592079Z 
2025-05-07T20:25:28.2592083Z 
2025-05-07T20:25:28.2592086Z 
2025-05-07T20:25:28.2592090Z 
2025-05-07T20:25:28.2592376Z cuda-nsight-12.6.77  | 113.2 MB  |            |   0% [A[A[A[A
2025-05-07T20:25:28.2592773Z 
2025-05-07T20:25:28.2592789Z 
2025-05-07T20:25:28.2592796Z 
2025-05-07T20:25:28.2592802Z 
2025-05-07T20:25:28.2592812Z 
2025-05-07T20:25:28.2601452Z cuda-nvvp-12.6.80    | 109.3 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:25:28.2601733Z 
2025-05-07T20:25:28.2601736Z 
2025-05-07T20:25:28.2601748Z 
2025-05-07T20:25:28.2601751Z 
2025-05-07T20:25:28.2601755Z 
2025-05-07T20:25:28.2602008Z 
2025-05-07T20:25:28.2605384Z libcusolver-11.7.1.2 | 95.8 MB   |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:28.2605669Z 
2025-05-07T20:25:28.2605673Z 
2025-05-07T20:25:28.2605677Z 
2025-05-07T20:25:28.2605680Z 
2025-05-07T20:25:28.2605684Z 
2025-05-07T20:25:28.2605688Z 
2025-05-07T20:25:28.2607711Z 
2025-05-07T20:25:28.2609969Z libnpp-12.3.1.54     | 93.4 MB   |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:28.2610309Z 
2025-05-07T20:25:28.2610314Z 
2025-05-07T20:25:28.2610319Z 
2025-05-07T20:25:28.2610325Z 
2025-05-07T20:25:28.2610329Z 
2025-05-07T20:25:28.2610332Z 
2025-05-07T20:25:28.2610336Z 
2025-05-07T20:25:28.2610576Z 
2025-05-07T20:25:28.2611300Z cuda-nvdisasm-12.6.7 | 47.6 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:28.2611591Z 
2025-05-07T20:25:28.2611595Z 
2025-05-07T20:25:28.2611604Z 
2025-05-07T20:25:28.2611607Z 
2025-05-07T20:25:28.2611611Z 
2025-05-07T20:25:28.2611745Z 
2025-05-07T20:25:28.2611748Z 
2025-05-07T20:25:28.2611752Z 
2025-05-07T20:25:28.2611756Z 
2025-05-07T20:25:28.2613439Z libcurand-10.3.7.77  | 39.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.2613783Z 
2025-05-07T20:25:28.2613787Z 
2025-05-07T20:25:28.2613790Z 
2025-05-07T20:25:28.2613794Z 
2025-05-07T20:25:28.2613798Z 
2025-05-07T20:25:28.2613801Z 
2025-05-07T20:25:28.2613805Z 
2025-05-07T20:25:28.2613808Z 
2025-05-07T20:25:28.2613812Z 
2025-05-07T20:25:28.2613815Z 
2025-05-07T20:25:28.2614739Z gds-tools-1.11.1.6   | 37.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.2615083Z 
2025-05-07T20:25:28.2615087Z 
2025-05-07T20:25:28.2615090Z 
2025-05-07T20:25:28.2615104Z 
2025-05-07T20:25:28.2615120Z 
2025-05-07T20:25:28.2615123Z 
2025-05-07T20:25:28.2615127Z 
2025-05-07T20:25:28.2615130Z 
2025-05-07T20:25:28.2615134Z 
2025-05-07T20:25:28.2615137Z 
2025-05-07T20:25:28.2615141Z 
2025-05-07T20:25:28.2616952Z cuda-nvcc-tools-12.6 | 23.0 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.2617287Z 
2025-05-07T20:25:28.2617291Z 
2025-05-07T20:25:28.2617295Z 
2025-05-07T20:25:28.2617298Z 
2025-05-07T20:25:28.2617302Z 
2025-05-07T20:25:28.2617305Z 
2025-05-07T20:25:28.2617309Z 
2025-05-07T20:25:28.2617313Z 
2025-05-07T20:25:28.2617316Z 
2025-05-07T20:25:28.2617320Z 
2025-05-07T20:25:28.2617323Z 
2025-05-07T20:25:28.2617327Z 
2025-05-07T20:25:28.2618127Z python-3.9.18        | 22.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.2618472Z 
2025-05-07T20:25:28.2618476Z 
2025-05-07T20:25:28.2618485Z 
2025-05-07T20:25:28.2618489Z 
2025-05-07T20:25:28.2618492Z 
2025-05-07T20:25:28.2618496Z 
2025-05-07T20:25:28.2618505Z 
2025-05-07T20:25:28.2618509Z 
2025-05-07T20:25:28.2618513Z 
2025-05-07T20:25:28.2618516Z 
2025-05-07T20:25:28.2618520Z 
2025-05-07T20:25:28.2618523Z 
2025-05-07T20:25:28.2618527Z 
2025-05-07T20:25:28.2619528Z cuda-nvrtc-12.6.85   | 17.3 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.2619933Z 
2025-05-07T20:25:28.2619936Z 
2025-05-07T20:25:28.2619940Z 
2025-05-07T20:25:28.2619944Z 
2025-05-07T20:25:28.2619947Z 
2025-05-07T20:25:28.2619951Z 
2025-05-07T20:25:28.2619965Z 
2025-05-07T20:25:28.2619973Z 
2025-05-07T20:25:28.2619977Z 
2025-05-07T20:25:28.2619980Z 
2025-05-07T20:25:28.2619984Z 
2025-05-07T20:25:28.2619988Z 
2025-05-07T20:25:28.2619992Z 
2025-05-07T20:25:28.2619995Z 
2025-05-07T20:25:28.2620931Z libnvjitlink-12.6.85 | 14.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.2621360Z 
2025-05-07T20:25:28.2621365Z 
2025-05-07T20:25:28.2621377Z 
2025-05-07T20:25:28.2621383Z 
2025-05-07T20:25:28.2621388Z 
2025-05-07T20:25:28.2621400Z 
2025-05-07T20:25:28.2621406Z 
2025-05-07T20:25:28.2621411Z 
2025-05-07T20:25:28.2621416Z 
2025-05-07T20:25:28.2621421Z 
2025-05-07T20:25:28.2621427Z 
2025-05-07T20:25:28.2621432Z 
2025-05-07T20:25:28.2621437Z 
2025-05-07T20:25:28.2621442Z 
2025-05-07T20:25:28.2621850Z 
2025-05-07T20:25:28.2623841Z cuda-nvcc-dev_linux- | 10.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.2624265Z 
2025-05-07T20:25:28.2624270Z 
2025-05-07T20:25:28.2624276Z 
2025-05-07T20:25:28.2624281Z 
2025-05-07T20:25:28.2624286Z 
2025-05-07T20:25:28.2624291Z 
2025-05-07T20:25:28.2624304Z 
2025-05-07T20:25:28.2624310Z 
2025-05-07T20:25:28.2624315Z 
2025-05-07T20:25:28.2624320Z 
2025-05-07T20:25:28.2624325Z 
2025-05-07T20:25:28.2624330Z 
2025-05-07T20:25:28.2624336Z 
2025-05-07T20:25:28.2624341Z 
2025-05-07T20:25:28.2624346Z 
2025-05-07T20:25:28.2624351Z 
2025-05-07T20:25:28.2625234Z cuda-nvvm-tools-12.6 | 10.4 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.2625809Z 
2025-05-07T20:25:28.2625815Z 
2025-05-07T20:25:28.2625821Z 
2025-05-07T20:25:28.2625826Z 
2025-05-07T20:25:28.2625831Z 
2025-05-07T20:25:28.2625836Z 
2025-05-07T20:25:28.2625850Z 
2025-05-07T20:25:28.2625855Z 
2025-05-07T20:25:28.2625860Z 
2025-05-07T20:25:28.2625959Z 
2025-05-07T20:25:28.2625963Z 
2025-05-07T20:25:28.2625967Z 
2025-05-07T20:25:28.2625970Z 
2025-05-07T20:25:28.2625974Z 
2025-05-07T20:25:28.2625978Z 
2025-05-07T20:25:28.2625981Z 
2025-05-07T20:25:28.2625985Z 
2025-05-07T20:25:28.2628474Z cuda-sanitizer-api-1 | 8.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.2628918Z 
2025-05-07T20:25:28.2628923Z 
2025-05-07T20:25:28.2628929Z 
2025-05-07T20:25:28.2628934Z 
2025-05-07T20:25:28.2628946Z 
2025-05-07T20:25:28.2628951Z 
2025-05-07T20:25:28.2628956Z 
2025-05-07T20:25:28.2628961Z 
2025-05-07T20:25:28.2628966Z 
2025-05-07T20:25:28.2628972Z 
2025-05-07T20:25:28.2628977Z 
2025-05-07T20:25:28.2628982Z 
2025-05-07T20:25:28.2628987Z 
2025-05-07T20:25:28.2629001Z 
2025-05-07T20:25:28.2629006Z 
2025-05-07T20:25:28.2629012Z 
2025-05-07T20:25:28.2629017Z 
2025-05-07T20:25:28.2629022Z 
2025-05-07T20:25:28.2629750Z cuda-nvvm-impl-12.6. | 7.7 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.2630337Z 
2025-05-07T20:25:28.2630343Z 
2025-05-07T20:25:28.2630347Z 
2025-05-07T20:25:28.2630353Z 
2025-05-07T20:25:28.2630358Z 
2025-05-07T20:25:28.2630363Z 
2025-05-07T20:25:28.2630368Z 
2025-05-07T20:25:28.2630374Z 
2025-05-07T20:25:28.2630379Z 
2025-05-07T20:25:28.2630384Z 
2025-05-07T20:25:28.2630389Z 
2025-05-07T20:25:28.2630395Z 
2025-05-07T20:25:28.2630405Z 
2025-05-07T20:25:28.2630410Z 
2025-05-07T20:25:28.2630415Z 
2025-05-07T20:25:28.2630421Z 
2025-05-07T20:25:28.2630426Z 
2025-05-07T20:25:28.2630431Z 
2025-05-07T20:25:28.2630436Z 
2025-05-07T20:25:28.3523775Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.3529348Z nsight-compute-2024. | 443.1 MB  |            |   0% 
2025-05-07T20:25:28.3530840Z 
2025-05-07T20:25:28.3544546Z libcublas-12.6.4.1   | 256.2 MB  |            |   0% [A
2025-05-07T20:25:28.3544893Z 
2025-05-07T20:25:28.3545090Z 
2025-05-07T20:25:28.3557904Z libcufft-11.3.0.4    | 156.2 MB  | 3          |   4% [A[A
2025-05-07T20:25:28.3558274Z 
2025-05-07T20:25:28.3558280Z 
2025-05-07T20:25:28.3558811Z 
2025-05-07T20:25:28.3584113Z libcusparse-12.5.4.2 | 118.6 MB  |            |   0% [A[A[A
2025-05-07T20:25:28.3584578Z 
2025-05-07T20:25:28.3584584Z 
2025-05-07T20:25:28.3584590Z 
2025-05-07T20:25:28.3585434Z 
2025-05-07T20:25:28.4528105Z cuda-nsight-12.6.77  | 113.2 MB  |            |   0% [A[A[A[A
2025-05-07T20:25:28.4536762Z nsight-compute-2024. | 443.1 MB  | 1          |   1% 
2025-05-07T20:25:28.4539459Z 
2025-05-07T20:25:28.4558513Z libcublas-12.6.4.1   | 256.2 MB  | 1          |   1% [A
2025-05-07T20:25:28.4558864Z 
2025-05-07T20:25:28.4558870Z 
2025-05-07T20:25:28.4564855Z 
2025-05-07T20:25:28.4589134Z libcusparse-12.5.4.2 | 118.6 MB  | 2          |   3% [A[A[A
2025-05-07T20:25:28.4589528Z 
2025-05-07T20:25:28.4589534Z 
2025-05-07T20:25:28.4589540Z 
2025-05-07T20:25:28.4592190Z 
2025-05-07T20:25:28.5041695Z cuda-nsight-12.6.77  | 113.2 MB  | 3          |   3% [A[A[A[A
2025-05-07T20:25:28.5042114Z 
2025-05-07T20:25:28.5042128Z 
2025-05-07T20:25:28.5529774Z libcufft-11.3.0.4    | 156.2 MB  | 7          |   7% [A[A
2025-05-07T20:25:28.5532559Z nsight-compute-2024. | 443.1 MB  | 1          |   2% 
2025-05-07T20:25:28.5534744Z 
2025-05-07T20:25:28.5562741Z libcublas-12.6.4.1   | 256.2 MB  | 2          |   3% [A
2025-05-07T20:25:28.5563076Z 
2025-05-07T20:25:28.5563082Z 
2025-05-07T20:25:28.5564448Z 
2025-05-07T20:25:28.5591579Z libcusparse-12.5.4.2 | 118.6 MB  | 5          |   6% [A[A[A
2025-05-07T20:25:28.5591947Z 
2025-05-07T20:25:28.5591952Z 
2025-05-07T20:25:28.5591957Z 
2025-05-07T20:25:28.5591963Z 
2025-05-07T20:25:28.6305704Z cuda-nsight-12.6.77  | 113.2 MB  | 6          |   7% [A[A[A[A
2025-05-07T20:25:28.6306401Z 
2025-05-07T20:25:28.6306410Z 
2025-05-07T20:25:28.6532544Z libcufft-11.3.0.4    | 156.2 MB  | #          |  10% [A[A
2025-05-07T20:25:28.6533503Z nsight-compute-2024. | 443.1 MB  | 2          |   3% 
2025-05-07T20:25:28.6540072Z 
2025-05-07T20:25:28.6567173Z libcublas-12.6.4.1   | 256.2 MB  | 4          |   4% [A
2025-05-07T20:25:28.6567757Z 
2025-05-07T20:25:28.6567762Z 
2025-05-07T20:25:28.6569386Z 
2025-05-07T20:25:28.6597892Z libcusparse-12.5.4.2 | 118.6 MB  | 8          |   9% [A[A[A
2025-05-07T20:25:28.6598273Z 
2025-05-07T20:25:28.6598278Z 
2025-05-07T20:25:28.6598284Z 
2025-05-07T20:25:28.6600064Z 
2025-05-07T20:25:28.7461848Z cuda-nsight-12.6.77  | 113.2 MB  | 9          |  10% [A[A[A[A
2025-05-07T20:25:28.7462241Z 
2025-05-07T20:25:28.7464069Z 
2025-05-07T20:25:28.7536941Z libcufft-11.3.0.4    | 156.2 MB  | #2         |  13% [A[A
2025-05-07T20:25:28.7537863Z nsight-compute-2024. | 443.1 MB  | 3          |   3% 
2025-05-07T20:25:28.7540092Z 
2025-05-07T20:25:28.7569581Z libcublas-12.6.4.1   | 256.2 MB  | 5          |   5% [A
2025-05-07T20:25:28.7569930Z 
2025-05-07T20:25:28.7569936Z 
2025-05-07T20:25:28.7571962Z 
2025-05-07T20:25:28.7598966Z libcusparse-12.5.4.2 | 118.6 MB  | #1         |  12% [A[A[A
2025-05-07T20:25:28.7599337Z 
2025-05-07T20:25:28.7599359Z 
2025-05-07T20:25:28.7599364Z 
2025-05-07T20:25:28.7599870Z 
2025-05-07T20:25:28.8494023Z cuda-nsight-12.6.77  | 113.2 MB  | #2         |  13% [A[A[A[A
2025-05-07T20:25:28.8494402Z 
2025-05-07T20:25:28.8496159Z 
2025-05-07T20:25:28.8542944Z libcufft-11.3.0.4    | 156.2 MB  | #5         |  15% [A[A
2025-05-07T20:25:28.8570428Z nsight-compute-2024. | 443.1 MB  | 3          |   4% 
2025-05-07T20:25:28.8570785Z 
2025-05-07T20:25:28.8570791Z 
2025-05-07T20:25:28.8571060Z 
2025-05-07T20:25:28.8599122Z libcusparse-12.5.4.2 | 118.6 MB  | #4         |  15% [A[A[A
2025-05-07T20:25:28.8599510Z 
2025-05-07T20:25:28.8599516Z 
2025-05-07T20:25:28.8599522Z 
2025-05-07T20:25:28.8600013Z 
2025-05-07T20:25:28.8668733Z cuda-nsight-12.6.77  | 113.2 MB  | #6         |  16% [A[A[A[A
2025-05-07T20:25:28.8669130Z 
2025-05-07T20:25:28.9543986Z libcublas-12.6.4.1   | 256.2 MB  | 6          |   6% [A
2025-05-07T20:25:28.9572218Z nsight-compute-2024. | 443.1 MB  | 4          |   5% 
2025-05-07T20:25:28.9572575Z 
2025-05-07T20:25:28.9572598Z 
2025-05-07T20:25:28.9573374Z 
2025-05-07T20:25:28.9609057Z libcusparse-12.5.4.2 | 118.6 MB  | #7         |  18% [A[A[A
2025-05-07T20:25:28.9609438Z 
2025-05-07T20:25:28.9609444Z 
2025-05-07T20:25:28.9643993Z libcufft-11.3.0.4    | 156.2 MB  | #7         |  18% [A[A
2025-05-07T20:25:28.9644346Z 
2025-05-07T20:25:28.9644352Z 
2025-05-07T20:25:28.9644357Z 
2025-05-07T20:25:28.9644372Z 
2025-05-07T20:25:28.9673408Z cuda-nsight-12.6.77  | 113.2 MB  | #9         |  19% [A[A[A[A
2025-05-07T20:25:28.9674147Z 
2025-05-07T20:25:29.0548409Z libcublas-12.6.4.1   | 256.2 MB  | 7          |   8% [A
2025-05-07T20:25:29.0580765Z nsight-compute-2024. | 443.1 MB  | 5          |   5% 
2025-05-07T20:25:29.0581126Z 
2025-05-07T20:25:29.0581132Z 
2025-05-07T20:25:29.0582382Z 
2025-05-07T20:25:29.0639941Z libcusparse-12.5.4.2 | 118.6 MB  | ##         |  21% [A[A[A
2025-05-07T20:25:29.0640312Z 
2025-05-07T20:25:29.0640319Z 
2025-05-07T20:25:29.0673845Z libcufft-11.3.0.4    | 156.2 MB  | ##         |  20% [A[A
2025-05-07T20:25:29.0674667Z 
2025-05-07T20:25:29.0695924Z libcublas-12.6.4.1   | 256.2 MB  | 9          |   9% [A
2025-05-07T20:25:29.0696271Z 
2025-05-07T20:25:29.0696277Z 
2025-05-07T20:25:29.0696282Z 
2025-05-07T20:25:29.0697391Z 
2025-05-07T20:25:29.1548861Z cuda-nsight-12.6.77  | 113.2 MB  | ##2        |  22% [A[A[A[A
2025-05-07T20:25:29.1588519Z nsight-compute-2024. | 443.1 MB  | 6          |   6% 
2025-05-07T20:25:29.1588862Z 
2025-05-07T20:25:29.1588868Z 
2025-05-07T20:25:29.1591025Z 
2025-05-07T20:25:29.1674318Z libcusparse-12.5.4.2 | 118.6 MB  | ##3        |  24% [A[A[A
2025-05-07T20:25:29.1676008Z 
2025-05-07T20:25:29.1686543Z libcublas-12.6.4.1   | 256.2 MB  | #          |  11% [A
2025-05-07T20:25:29.1687169Z 
2025-05-07T20:25:29.1687177Z 
2025-05-07T20:25:29.1699235Z libcufft-11.3.0.4    | 156.2 MB  | ##2        |  22% [A[A
2025-05-07T20:25:29.1699588Z 
2025-05-07T20:25:29.1699594Z 
2025-05-07T20:25:29.1699599Z 
2025-05-07T20:25:29.1700379Z 
2025-05-07T20:25:29.2572830Z cuda-nsight-12.6.77  | 113.2 MB  | ##5        |  26% [A[A[A[A
2025-05-07T20:25:29.2657871Z nsight-compute-2024. | 443.1 MB  | 6          |   7% 
2025-05-07T20:25:29.2658295Z 
2025-05-07T20:25:29.2658300Z 
2025-05-07T20:25:29.2659573Z 
2025-05-07T20:25:29.2675334Z libcusparse-12.5.4.2 | 118.6 MB  | ##6        |  27% [A[A[A
2025-05-07T20:25:29.2678180Z 
2025-05-07T20:25:29.2692681Z libcublas-12.6.4.1   | 256.2 MB  | #1         |  12% [A
2025-05-07T20:25:29.2693041Z 
2025-05-07T20:25:29.2695390Z 
2025-05-07T20:25:29.2724268Z libcufft-11.3.0.4    | 156.2 MB  | ##4        |  25% [A[A
2025-05-07T20:25:29.2724691Z 
2025-05-07T20:25:29.2724697Z 
2025-05-07T20:25:29.2724702Z 
2025-05-07T20:25:29.2725392Z 
2025-05-07T20:25:29.3576892Z cuda-nsight-12.6.77  | 113.2 MB  | ##8        |  29% [A[A[A[A
2025-05-07T20:25:29.3658465Z nsight-compute-2024. | 443.1 MB  | 7          |   8% 
2025-05-07T20:25:29.3658810Z 
2025-05-07T20:25:29.3658815Z 
2025-05-07T20:25:29.3660323Z 
2025-05-07T20:25:29.3679652Z libcusparse-12.5.4.2 | 118.6 MB  | ##9        |  30% [A[A[A
2025-05-07T20:25:29.3681978Z 
2025-05-07T20:25:29.3695623Z libcublas-12.6.4.1   | 256.2 MB  | #3         |  13% [A
2025-05-07T20:25:29.3695959Z 
2025-05-07T20:25:29.3697429Z 
2025-05-07T20:25:29.3724971Z libcufft-11.3.0.4    | 156.2 MB  | ##7        |  27% [A[A
2025-05-07T20:25:29.3725400Z 
2025-05-07T20:25:29.3725405Z 
2025-05-07T20:25:29.3725411Z 
2025-05-07T20:25:29.3725416Z 
2025-05-07T20:25:29.4580016Z cuda-nsight-12.6.77  | 113.2 MB  | ###1       |  32% [A[A[A[A
2025-05-07T20:25:29.4661273Z nsight-compute-2024. | 443.1 MB  | 8          |   9% 
2025-05-07T20:25:29.4661614Z 
2025-05-07T20:25:29.4661618Z 
2025-05-07T20:25:29.4662196Z 
2025-05-07T20:25:29.4701188Z libcusparse-12.5.4.2 | 118.6 MB  | ###2       |  33% [A[A[A
2025-05-07T20:25:29.4701499Z 
2025-05-07T20:25:29.4701503Z 
2025-05-07T20:25:29.4742523Z libcufft-11.3.0.4    | 156.2 MB  | ##9        |  30% [A[A
2025-05-07T20:25:29.4742849Z 
2025-05-07T20:25:29.4742854Z 
2025-05-07T20:25:29.4742872Z 
2025-05-07T20:25:29.4743581Z 
2025-05-07T20:25:29.4859641Z cuda-nsight-12.6.77  | 113.2 MB  | ###4       |  35% [A[A[A[A
2025-05-07T20:25:29.4860522Z 
2025-05-07T20:25:29.5607481Z libcublas-12.6.4.1   | 256.2 MB  | #4         |  15% [A
2025-05-07T20:25:29.5663997Z nsight-compute-2024. | 443.1 MB  | 9          |   9% 
2025-05-07T20:25:29.5664414Z 
2025-05-07T20:25:29.5664418Z 
2025-05-07T20:25:29.5665925Z 
2025-05-07T20:25:29.5702643Z libcusparse-12.5.4.2 | 118.6 MB  | ###5       |  36% [A[A[A
2025-05-07T20:25:29.5702911Z 
2025-05-07T20:25:29.5702915Z 
2025-05-07T20:25:29.5745390Z libcufft-11.3.0.4    | 156.2 MB  | ###2       |  32% [A[A
2025-05-07T20:25:29.5745641Z 
2025-05-07T20:25:29.5745645Z 
2025-05-07T20:25:29.5745664Z 
2025-05-07T20:25:29.5749509Z 
2025-05-07T20:25:29.5862157Z cuda-nsight-12.6.77  | 113.2 MB  | ###8       |  38% [A[A[A[A
2025-05-07T20:25:29.5863232Z 
2025-05-07T20:25:29.6667396Z libcublas-12.6.4.1   | 256.2 MB  | #6         |  16% [A
2025-05-07T20:25:29.6667691Z 
2025-05-07T20:25:29.6667696Z 
2025-05-07T20:25:29.6669597Z 
2025-05-07T20:25:29.6675509Z libcusparse-12.5.4.2 | 118.6 MB  | ###8       |  39% [A[A[A
2025-05-07T20:25:29.6705793Z nsight-compute-2024. | 443.1 MB  | #          |  10% 
2025-05-07T20:25:29.6706050Z 
2025-05-07T20:25:29.6706054Z 
2025-05-07T20:25:29.6865777Z libcufft-11.3.0.4    | 156.2 MB  | ###4       |  35% [A[A
2025-05-07T20:25:29.6867960Z 
2025-05-07T20:25:29.6875118Z libcublas-12.6.4.1   | 256.2 MB  | #7         |  18% [A
2025-05-07T20:25:29.6875364Z 
2025-05-07T20:25:29.6875368Z 
2025-05-07T20:25:29.6875383Z 
2025-05-07T20:25:29.6879387Z 
2025-05-07T20:25:29.7670129Z cuda-nsight-12.6.77  | 113.2 MB  | ####1      |  41% [A[A[A[A
2025-05-07T20:25:29.7670666Z 
2025-05-07T20:25:29.7670671Z 
2025-05-07T20:25:29.7671308Z 
2025-05-07T20:25:29.7710833Z libcusparse-12.5.4.2 | 118.6 MB  | ####1      |  42% [A[A[A
2025-05-07T20:25:29.7711113Z 
2025-05-07T20:25:29.7711117Z 
2025-05-07T20:25:29.7717036Z libcufft-11.3.0.4    | 156.2 MB  | ###7       |  37% [A[A
2025-05-07T20:25:29.7868529Z nsight-compute-2024. | 443.1 MB  | #          |  11% 
2025-05-07T20:25:29.7869079Z 
2025-05-07T20:25:29.7889778Z libcublas-12.6.4.1   | 256.2 MB  | #9         |  19% [A
2025-05-07T20:25:29.7890139Z 
2025-05-07T20:25:29.7890143Z 
2025-05-07T20:25:29.7890147Z 
2025-05-07T20:25:29.7891440Z 
2025-05-07T20:25:29.8721911Z cuda-nsight-12.6.77  | 113.2 MB  | ####4      |  44% [A[A[A[A
2025-05-07T20:25:29.8723560Z nsight-compute-2024. | 443.1 MB  | #1         |  12% 
2025-05-07T20:25:29.8723805Z 
2025-05-07T20:25:29.8723992Z 
2025-05-07T20:25:29.8777056Z libcufft-11.3.0.4    | 156.2 MB  | ###9       |  39% [A[A
2025-05-07T20:25:29.8777373Z 
2025-05-07T20:25:29.8777377Z 
2025-05-07T20:25:29.8779556Z 
2025-05-07T20:25:29.8891806Z libcusparse-12.5.4.2 | 118.6 MB  | ####4      |  45% [A[A[A
2025-05-07T20:25:29.8892081Z 
2025-05-07T20:25:29.8892085Z 
2025-05-07T20:25:29.8892090Z 
2025-05-07T20:25:29.8894844Z 
2025-05-07T20:25:29.9042691Z cuda-nsight-12.6.77  | 113.2 MB  | ####7      |  47% [A[A[A[A
2025-05-07T20:25:29.9045235Z 
2025-05-07T20:25:29.9724408Z libcublas-12.6.4.1   | 256.2 MB  | ##         |  21% [A
2025-05-07T20:25:29.9726234Z nsight-compute-2024. | 443.1 MB  | #2         |  13% 
2025-05-07T20:25:29.9726588Z 
2025-05-07T20:25:29.9729150Z 
2025-05-07T20:25:29.9779796Z libcufft-11.3.0.4    | 156.2 MB  | ####1      |  42% [A[A
2025-05-07T20:25:29.9780095Z 
2025-05-07T20:25:29.9780099Z 
2025-05-07T20:25:29.9780103Z 
2025-05-07T20:25:29.9892455Z libcusparse-12.5.4.2 | 118.6 MB  | ####7      |  48% [A[A[A
2025-05-07T20:25:29.9892750Z 
2025-05-07T20:25:29.9892754Z 
2025-05-07T20:25:29.9892758Z 
2025-05-07T20:25:29.9893197Z 
2025-05-07T20:25:30.0605657Z cuda-nsight-12.6.77  | 113.2 MB  | #####      |  51% [A[A[A[A
2025-05-07T20:25:30.0605990Z 
2025-05-07T20:25:30.0734616Z libcublas-12.6.4.1   | 256.2 MB  | ##1        |  22% [A
2025-05-07T20:25:30.0734876Z 
2025-05-07T20:25:30.0736964Z 
2025-05-07T20:25:30.0741264Z libcufft-11.3.0.4    | 156.2 MB  | ####4      |  44% [A[A
2025-05-07T20:25:30.0805906Z nsight-compute-2024. | 443.1 MB  | #3         |  13% 
2025-05-07T20:25:30.0806195Z 
2025-05-07T20:25:30.0806199Z 
2025-05-07T20:25:30.0806203Z 
2025-05-07T20:25:30.0929226Z libcusparse-12.5.4.2 | 118.6 MB  | #####      |  51% [A[A[A
2025-05-07T20:25:30.0929574Z 
2025-05-07T20:25:30.0929577Z 
2025-05-07T20:25:30.0929581Z 
2025-05-07T20:25:30.0929585Z 
2025-05-07T20:25:30.1737906Z cuda-nsight-12.6.77  | 113.2 MB  | #####3     |  54% [A[A[A[A
2025-05-07T20:25:30.1738194Z 
2025-05-07T20:25:30.1738198Z 
2025-05-07T20:25:30.1742005Z libcufft-11.3.0.4    | 156.2 MB  | ####7      |  47% [A[A
2025-05-07T20:25:30.1806542Z nsight-compute-2024. | 443.1 MB  | #4         |  14% 
2025-05-07T20:25:30.1806846Z 
2025-05-07T20:25:30.1806851Z 
2025-05-07T20:25:30.1809102Z 
2025-05-07T20:25:30.1931223Z libcusparse-12.5.4.2 | 118.6 MB  | #####3     |  54% [A[A[A
2025-05-07T20:25:30.1931512Z 
2025-05-07T20:25:30.1931516Z 
2025-05-07T20:25:30.1931519Z 
2025-05-07T20:25:30.1931536Z 
2025-05-07T20:25:30.2133612Z cuda-nsight-12.6.77  | 113.2 MB  | #####6     |  57% [A[A[A[A
2025-05-07T20:25:30.2136957Z 
2025-05-07T20:25:30.2787598Z libcublas-12.6.4.1   | 256.2 MB  | ##3        |  23% [A
2025-05-07T20:25:30.2887698Z nsight-compute-2024. | 443.1 MB  | #4         |  15% 
2025-05-07T20:25:30.2888015Z 
2025-05-07T20:25:30.2888020Z 
2025-05-07T20:25:30.2889235Z 
2025-05-07T20:25:30.2935686Z libcusparse-12.5.4.2 | 118.6 MB  | #####6     |  57% [A[A[A
2025-05-07T20:25:30.2935973Z 
2025-05-07T20:25:30.2937743Z 
2025-05-07T20:25:30.3017137Z libcufft-11.3.0.4    | 156.2 MB  | ####9      |  50% [A[A
2025-05-07T20:25:30.3017399Z 
2025-05-07T20:25:30.3017403Z 
2025-05-07T20:25:30.3017407Z 
2025-05-07T20:25:30.3021492Z 
2025-05-07T20:25:30.3133835Z cuda-nsight-12.6.77  | 113.2 MB  | #####9     |  60% [A[A[A[A
2025-05-07T20:25:30.3134138Z 
2025-05-07T20:25:30.3843816Z libcublas-12.6.4.1   | 256.2 MB  | ##4        |  24% [A
2025-05-07T20:25:30.3993483Z nsight-compute-2024. | 443.1 MB  | #5         |  16% 
2025-05-07T20:25:30.3994109Z 
2025-05-07T20:25:30.3994114Z 
2025-05-07T20:25:30.3995538Z 
2025-05-07T20:25:30.4010544Z libcusparse-12.5.4.2 | 118.6 MB  | #####9     |  59% [A[A[A
2025-05-07T20:25:30.4010820Z 
2025-05-07T20:25:30.4010824Z 
2025-05-07T20:25:30.4097642Z libcufft-11.3.0.4    | 156.2 MB  | #####2     |  52% [A[A
2025-05-07T20:25:30.4097923Z 
2025-05-07T20:25:30.4097927Z 
2025-05-07T20:25:30.4097931Z 
2025-05-07T20:25:30.4098371Z 
2025-05-07T20:25:30.4135862Z cuda-nsight-12.6.77  | 113.2 MB  | ######2    |  63% [A[A[A[A
2025-05-07T20:25:30.4136269Z 
2025-05-07T20:25:30.4844839Z libcublas-12.6.4.1   | 256.2 MB  | ##5        |  26% [A
2025-05-07T20:25:30.5068287Z nsight-compute-2024. | 443.1 MB  | #6         |  17% 
2025-05-07T20:25:30.5068644Z 
2025-05-07T20:25:30.5068650Z 
2025-05-07T20:25:30.5072909Z 
2025-05-07T20:25:30.5112252Z libcusparse-12.5.4.2 | 118.6 MB  | ######2    |  62% [A[A[A
2025-05-07T20:25:30.5112581Z 
2025-05-07T20:25:30.5112598Z 
2025-05-07T20:25:30.5126007Z libcufft-11.3.0.4    | 156.2 MB  | #####4     |  54% [A[A
2025-05-07T20:25:30.5126299Z 
2025-05-07T20:25:30.5126303Z 
2025-05-07T20:25:30.5126307Z 
2025-05-07T20:25:30.5126311Z 
2025-05-07T20:25:30.5135931Z cuda-nsight-12.6.77  | 113.2 MB  | ######5    |  66% [A[A[A[A
2025-05-07T20:25:30.5137645Z 
2025-05-07T20:25:30.6069639Z libcublas-12.6.4.1   | 256.2 MB  | ##7        |  27% [A
2025-05-07T20:25:30.6070013Z 
2025-05-07T20:25:30.6070016Z 
2025-05-07T20:25:30.6070517Z 
2025-05-07T20:25:30.6112676Z libcusparse-12.5.4.2 | 118.6 MB  | ######5    |  65% [A[A[A
2025-05-07T20:25:30.6113034Z 
2025-05-07T20:25:30.6113039Z 
2025-05-07T20:25:30.6128232Z libcufft-11.3.0.4    | 156.2 MB  | #####6     |  57% [A[A
2025-05-07T20:25:30.6128608Z 
2025-05-07T20:25:30.6128614Z 
2025-05-07T20:25:30.6128620Z 
2025-05-07T20:25:30.6130656Z 
2025-05-07T20:25:30.6139732Z cuda-nsight-12.6.77  | 113.2 MB  | ######8    |  69% [A[A[A[A
2025-05-07T20:25:30.6140066Z 
2025-05-07T20:25:30.6788806Z libcublas-12.6.4.1   | 256.2 MB  | ##8        |  29% [A
2025-05-07T20:25:30.7071093Z nsight-compute-2024. | 443.1 MB  | #7         |  17% 
2025-05-07T20:25:30.7071346Z 
2025-05-07T20:25:30.7071350Z 
2025-05-07T20:25:30.7074153Z 
2025-05-07T20:25:30.7131372Z libcusparse-12.5.4.2 | 118.6 MB  | ######8    |  68% [A[A[A
2025-05-07T20:25:30.7131739Z 
2025-05-07T20:25:30.7131743Z 
2025-05-07T20:25:30.7131747Z 
2025-05-07T20:25:30.7131750Z 
2025-05-07T20:25:30.7145272Z cuda-nsight-12.6.77  | 113.2 MB  | #######2   |  72% [A[A[A[A
2025-05-07T20:25:30.7145625Z 
2025-05-07T20:25:30.7530379Z libcublas-12.6.4.1   | 256.2 MB  | ###        |  30% [A
2025-05-07T20:25:30.7530651Z 
2025-05-07T20:25:30.7530655Z 
2025-05-07T20:25:30.7844712Z libcufft-11.3.0.4    | 156.2 MB  | #####9     |  59% [A[A
2025-05-07T20:25:30.8112349Z nsight-compute-2024. | 443.1 MB  | #8         |  18% 
2025-05-07T20:25:30.8112686Z 
2025-05-07T20:25:30.8112690Z 
2025-05-07T20:25:30.8112703Z 
2025-05-07T20:25:30.8189976Z libcusparse-12.5.4.2 | 118.6 MB  | #######1   |  71% [A[A[A
2025-05-07T20:25:30.8190294Z 
2025-05-07T20:25:30.8190298Z 
2025-05-07T20:25:30.8190302Z 
2025-05-07T20:25:30.8190306Z 
2025-05-07T20:25:30.8242941Z cuda-nsight-12.6.77  | 113.2 MB  | #######5   |  75% [A[A[A[A
2025-05-07T20:25:30.8247572Z 
2025-05-07T20:25:30.8531611Z libcublas-12.6.4.1   | 256.2 MB  | ###1       |  32% [A
2025-05-07T20:25:30.8531936Z 
2025-05-07T20:25:30.8531941Z 
2025-05-07T20:25:30.8848381Z libcufft-11.3.0.4    | 156.2 MB  | ######1    |  61% [A[A
2025-05-07T20:25:30.9237750Z nsight-compute-2024. | 443.1 MB  | #8         |  19% 
2025-05-07T20:25:30.9238055Z 
2025-05-07T20:25:30.9238059Z 
2025-05-07T20:25:30.9238063Z 
2025-05-07T20:25:30.9238067Z 
2025-05-07T20:25:30.9250941Z cuda-nsight-12.6.77  | 113.2 MB  | #######8   |  79% [A[A[A[A
2025-05-07T20:25:30.9251284Z 
2025-05-07T20:25:30.9251288Z 
2025-05-07T20:25:30.9251292Z 
2025-05-07T20:25:30.9368366Z libcusparse-12.5.4.2 | 118.6 MB  | #######4   |  74% [A[A[A
2025-05-07T20:25:30.9369023Z 
2025-05-07T20:25:30.9534186Z libcublas-12.6.4.1   | 256.2 MB  | ###2       |  33% [A
2025-05-07T20:25:30.9534506Z 
2025-05-07T20:25:30.9534510Z 
2025-05-07T20:25:30.9853504Z libcufft-11.3.0.4    | 156.2 MB  | ######3    |  64% [A[A
2025-05-07T20:25:31.0362300Z nsight-compute-2024. | 443.1 MB  | #9         |  20% 
2025-05-07T20:25:31.0362573Z 
2025-05-07T20:25:31.0362579Z 
2025-05-07T20:25:31.0364067Z 
2025-05-07T20:25:31.0366269Z libcusparse-12.5.4.2 | 118.6 MB  | #######7   |  77% [A[A[A
2025-05-07T20:25:31.0366551Z 
2025-05-07T20:25:31.0366556Z 
2025-05-07T20:25:31.0366560Z 
2025-05-07T20:25:31.0366563Z 
2025-05-07T20:25:31.0371758Z cuda-nsight-12.6.77  | 113.2 MB  | ########1  |  82% [A[A[A[A
2025-05-07T20:25:31.0374108Z 
2025-05-07T20:25:31.0534474Z libcublas-12.6.4.1   | 256.2 MB  | ###4       |  34% [A
2025-05-07T20:25:31.0534742Z 
2025-05-07T20:25:31.0534747Z 
2025-05-07T20:25:31.0854130Z libcufft-11.3.0.4    | 156.2 MB  | ######6    |  66% [A[A
2025-05-07T20:25:31.1388180Z nsight-compute-2024. | 443.1 MB  | ##         |  20% 
2025-05-07T20:25:31.1388459Z 
2025-05-07T20:25:31.1407675Z libcublas-12.6.4.1   | 256.2 MB  | ###5       |  36% [A
2025-05-07T20:25:31.1407938Z 
2025-05-07T20:25:31.1407942Z 
2025-05-07T20:25:31.1407946Z 
2025-05-07T20:25:31.1464667Z libcusparse-12.5.4.2 | 118.6 MB  | #######9   |  80% [A[A[A
2025-05-07T20:25:31.1464950Z 
2025-05-07T20:25:31.1464954Z 
2025-05-07T20:25:31.1464958Z 
2025-05-07T20:25:31.1464962Z 
2025-05-07T20:25:31.1572133Z cuda-nsight-12.6.77  | 113.2 MB  | ########4  |  85% [A[A[A[A
2025-05-07T20:25:31.1572493Z 
2025-05-07T20:25:31.1572499Z 
2025-05-07T20:25:31.1857325Z libcufft-11.3.0.4    | 156.2 MB  | ######8    |  68% [A[A
2025-05-07T20:25:31.2393243Z nsight-compute-2024. | 443.1 MB  | ##1        |  21% 
2025-05-07T20:25:31.2393570Z 
2025-05-07T20:25:31.2412652Z libcublas-12.6.4.1   | 256.2 MB  | ###7       |  37% [A
2025-05-07T20:25:31.2412918Z 
2025-05-07T20:25:31.2412923Z 
2025-05-07T20:25:31.2414570Z 
2025-05-07T20:25:31.2576224Z libcusparse-12.5.4.2 | 118.6 MB  | ########3  |  83% [A[A[A
2025-05-07T20:25:31.2576518Z 
2025-05-07T20:25:31.2576522Z 
2025-05-07T20:25:31.2863361Z libcufft-11.3.0.4    | 156.2 MB  | #######1   |  71% [A[A
2025-05-07T20:25:31.3097677Z nsight-compute-2024. | 443.1 MB  | ##1        |  22% 
2025-05-07T20:25:31.3098065Z 
2025-05-07T20:25:31.3098071Z 
2025-05-07T20:25:31.3098076Z 
2025-05-07T20:25:31.3099847Z 
2025-05-07T20:25:31.3447977Z cuda-nsight-12.6.77  | 113.2 MB  | ########7  |  87% [A[A[A[A
2025-05-07T20:25:31.3448269Z 
2025-05-07T20:25:31.3535355Z libcublas-12.6.4.1   | 256.2 MB  | ###8       |  39% [A
2025-05-07T20:25:31.3535618Z 
2025-05-07T20:25:31.3535622Z 
2025-05-07T20:25:31.3538873Z 
2025-05-07T20:25:31.3688087Z libcusparse-12.5.4.2 | 118.6 MB  | ########6  |  86% [A[A[A
2025-05-07T20:25:31.3688448Z 
2025-05-07T20:25:31.3691319Z 
2025-05-07T20:25:31.3975106Z libcufft-11.3.0.4    | 156.2 MB  | #######3   |  73% [A[A
2025-05-07T20:25:31.4100851Z nsight-compute-2024. | 443.1 MB  | ##2        |  23% 
2025-05-07T20:25:31.4101134Z 
2025-05-07T20:25:31.4101138Z 
2025-05-07T20:25:31.4101142Z 
2025-05-07T20:25:31.4101784Z 
2025-05-07T20:25:31.4467290Z cuda-nsight-12.6.77  | 113.2 MB  | #########  |  90% [A[A[A[A
2025-05-07T20:25:31.4467676Z 
2025-05-07T20:25:31.4563368Z libcublas-12.6.4.1   | 256.2 MB  | ####       |  40% [A
2025-05-07T20:25:31.4563632Z 
2025-05-07T20:25:31.4563639Z 
2025-05-07T20:25:31.4564896Z 
2025-05-07T20:25:31.4730808Z libcusparse-12.5.4.2 | 118.6 MB  | ########8  |  89% [A[A[A
2025-05-07T20:25:31.4731103Z 
2025-05-07T20:25:31.4731107Z 
2025-05-07T20:25:31.4976188Z libcufft-11.3.0.4    | 156.2 MB  | #######5   |  76% [A[A
2025-05-07T20:25:31.5104124Z nsight-compute-2024. | 443.1 MB  | ##3        |  24% 
2025-05-07T20:25:31.5104465Z 
2025-05-07T20:25:31.5104469Z 
2025-05-07T20:25:31.5104473Z 
2025-05-07T20:25:31.5105129Z 
2025-05-07T20:25:31.5536793Z cuda-nsight-12.6.77  | 113.2 MB  | #########3 |  93% [A[A[A[A
2025-05-07T20:25:31.5537094Z 
2025-05-07T20:25:31.5590710Z libcublas-12.6.4.1   | 256.2 MB  | ####1      |  41% [A
2025-05-07T20:25:31.5591047Z 
2025-05-07T20:25:31.5591053Z 
2025-05-07T20:25:31.5593249Z 
2025-05-07T20:25:31.5844552Z libcusparse-12.5.4.2 | 118.6 MB  | #########1 |  92% [A[A[A
2025-05-07T20:25:31.5844831Z 
2025-05-07T20:25:31.5844835Z 
2025-05-07T20:25:31.6103605Z libcufft-11.3.0.4    | 156.2 MB  | #######7   |  78% [A[A
2025-05-07T20:25:31.6108615Z nsight-compute-2024. | 443.1 MB  | ##4        |  24% 
2025-05-07T20:25:31.6108927Z 
2025-05-07T20:25:31.6108931Z 
2025-05-07T20:25:31.6108935Z 
2025-05-07T20:25:31.6110838Z 
2025-05-07T20:25:31.6642321Z cuda-nsight-12.6.77  | 113.2 MB  | #########5 |  96% [A[A[A[A
2025-05-07T20:25:31.6642704Z 
2025-05-07T20:25:31.6692542Z libcublas-12.6.4.1   | 256.2 MB  | ####2      |  43% [A
2025-05-07T20:25:31.6692866Z 
2025-05-07T20:25:31.6692871Z 
2025-05-07T20:25:31.6693622Z 
2025-05-07T20:25:31.6859779Z libcusparse-12.5.4.2 | 118.6 MB  | #########4 |  95% [A[A[A
2025-05-07T20:25:31.6860090Z 
2025-05-07T20:25:31.6860096Z 
2025-05-07T20:25:31.7119188Z libcufft-11.3.0.4    | 156.2 MB  | ########   |  80% [A[A
2025-05-07T20:25:31.7119493Z 
2025-05-07T20:25:31.7119499Z 
2025-05-07T20:25:31.7119504Z 
2025-05-07T20:25:31.7119509Z 
2025-05-07T20:25:31.7157120Z cuda-nsight-12.6.77  | 113.2 MB  | #########8 |  99% [A[A[A[A
2025-05-07T20:25:31.7683691Z nsight-compute-2024. | 443.1 MB  | ##5        |  25% 
2025-05-07T20:25:31.7683953Z 
2025-05-07T20:25:31.7777283Z libcublas-12.6.4.1   | 256.2 MB  | ####4      |  44% [A
2025-05-07T20:25:31.7777548Z 
2025-05-07T20:25:31.7777552Z 
2025-05-07T20:25:31.7777978Z 
2025-05-07T20:25:31.7861391Z libcusparse-12.5.4.2 | 118.6 MB  | #########7 |  97% [A[A[A
2025-05-07T20:25:31.7861689Z 
2025-05-07T20:25:31.7863173Z 
2025-05-07T20:25:31.8157631Z libcufft-11.3.0.4    | 156.2 MB  | ########2  |  82% [A[A
2025-05-07T20:25:31.8688384Z nsight-compute-2024. | 443.1 MB  | ##5        |  26% 
2025-05-07T20:25:31.8689312Z 
2025-05-07T20:25:31.8868248Z libcublas-12.6.4.1   | 256.2 MB  | ####5      |  46% [A
2025-05-07T20:25:31.8868539Z 
2025-05-07T20:25:31.8870212Z 
2025-05-07T20:25:31.9157775Z libcufft-11.3.0.4    | 156.2 MB  | ########5  |  85% [A[A
2025-05-07T20:25:31.9689788Z nsight-compute-2024. | 443.1 MB  | ##6        |  27% 
2025-05-07T20:25:31.9692117Z 
2025-05-07T20:25:31.9870784Z libcublas-12.6.4.1   | 256.2 MB  | ####7      |  47% [A
2025-05-07T20:25:31.9871114Z 
2025-05-07T20:25:31.9871118Z 
2025-05-07T20:25:32.0161897Z libcufft-11.3.0.4    | 156.2 MB  | ########7  |  88% [A[A
2025-05-07T20:25:32.0697633Z nsight-compute-2024. | 443.1 MB  | ##7        |  28% 
2025-05-07T20:25:32.0701749Z 
2025-05-07T20:25:32.0873260Z libcublas-12.6.4.1   | 256.2 MB  | ####8      |  49% [A
2025-05-07T20:25:32.0873564Z 
2025-05-07T20:25:32.0874299Z 
2025-05-07T20:25:32.1163389Z libcufft-11.3.0.4    | 156.2 MB  | #########  |  91% [A[A
2025-05-07T20:25:32.1699187Z nsight-compute-2024. | 443.1 MB  | ##8        |  29% 
2025-05-07T20:25:32.1699557Z 
2025-05-07T20:25:32.1877057Z libcublas-12.6.4.1   | 256.2 MB  | #####      |  50% [A
2025-05-07T20:25:32.1877329Z 
2025-05-07T20:25:32.1877333Z 
2025-05-07T20:25:32.2166229Z libcufft-11.3.0.4    | 156.2 MB  | #########3 |  93% [A[A
2025-05-07T20:25:32.2703092Z nsight-compute-2024. | 443.1 MB  | ##9        |  29% 
2025-05-07T20:25:32.2705204Z 
2025-05-07T20:25:32.2878593Z libcublas-12.6.4.1   | 256.2 MB  | #####1     |  52% [A
2025-05-07T20:25:32.2878857Z 
2025-05-07T20:25:32.2879525Z 
2025-05-07T20:25:32.3169560Z libcufft-11.3.0.4    | 156.2 MB  | #########6 |  96% [A[A
2025-05-07T20:25:32.3704326Z nsight-compute-2024. | 443.1 MB  | ###        |  30% 
2025-05-07T20:25:32.3705398Z 
2025-05-07T20:25:32.3881753Z libcublas-12.6.4.1   | 256.2 MB  | #####3     |  53% [A
2025-05-07T20:25:32.3882087Z 
2025-05-07T20:25:32.3882633Z 
2025-05-07T20:25:32.4170962Z libcufft-11.3.0.4    | 156.2 MB  | #########9 |  99% [A[A
2025-05-07T20:25:32.4705846Z nsight-compute-2024. | 443.1 MB  | ###1       |  31% 
2025-05-07T20:25:32.4706521Z 
2025-05-07T20:25:32.5173988Z libcublas-12.6.4.1   | 256.2 MB  | #####5     |  55% [A
2025-05-07T20:25:32.5711048Z nsight-compute-2024. | 443.1 MB  | ###2       |  33% 
2025-05-07T20:25:32.5711397Z 
2025-05-07T20:25:32.6174559Z libcublas-12.6.4.1   | 256.2 MB  | #####7     |  58% [A
2025-05-07T20:25:32.6710870Z nsight-compute-2024. | 443.1 MB  | ###3       |  34% 
2025-05-07T20:25:32.6711939Z 
2025-05-07T20:25:32.7174531Z libcublas-12.6.4.1   | 256.2 MB  | #####9     |  60% [A
2025-05-07T20:25:32.7945718Z nsight-compute-2024. | 443.1 MB  | ###5       |  35% 
2025-05-07T20:25:32.7946790Z 
2025-05-07T20:25:32.8174322Z libcublas-12.6.4.1   | 256.2 MB  | ######1    |  62% [A
2025-05-07T20:25:32.9043036Z nsight-compute-2024. | 443.1 MB  | ###6       |  37% 
2025-05-07T20:25:32.9043366Z 
2025-05-07T20:25:32.9176628Z libcublas-12.6.4.1   | 256.2 MB  | ######3    |  63% [A
2025-05-07T20:25:33.0043609Z nsight-compute-2024. | 443.1 MB  | ###8       |  38% 
2025-05-07T20:25:33.0044651Z 
2025-05-07T20:25:33.0177845Z libcublas-12.6.4.1   | 256.2 MB  | ######5    |  66% [A
2025-05-07T20:25:33.1044700Z nsight-compute-2024. | 443.1 MB  | ###9       |  39% 
2025-05-07T20:25:33.1045075Z 
2025-05-07T20:25:33.1180376Z libcublas-12.6.4.1   | 256.2 MB  | ######7    |  68% [A
2025-05-07T20:25:33.2045226Z nsight-compute-2024. | 443.1 MB  | ####       |  41% 
2025-05-07T20:25:33.2045580Z 
2025-05-07T20:25:33.2226818Z libcublas-12.6.4.1   | 256.2 MB  | ######9    |  69% [A
2025-05-07T20:25:33.3050547Z nsight-compute-2024. | 443.1 MB  | ####2      |  42% 
2025-05-07T20:25:33.3050843Z 
2025-05-07T20:25:33.3338508Z libcublas-12.6.4.1   | 256.2 MB  | #######1   |  72% [A
2025-05-07T20:25:33.4141210Z nsight-compute-2024. | 443.1 MB  | ####3      |  43% 
2025-05-07T20:25:33.4143509Z 
2025-05-07T20:25:33.4484638Z libcublas-12.6.4.1   | 256.2 MB  | #######3   |  73% [A
2025-05-07T20:25:33.5146046Z nsight-compute-2024. | 443.1 MB  | ####4      |  45% 
2025-05-07T20:25:33.5146553Z 
2025-05-07T20:25:33.5511782Z libcublas-12.6.4.1   | 256.2 MB  | #######5   |  75% [A
2025-05-07T20:25:33.6148256Z nsight-compute-2024. | 443.1 MB  | ####5      |  46% 
2025-05-07T20:25:33.6148632Z 
2025-05-07T20:25:33.6566941Z libcublas-12.6.4.1   | 256.2 MB  | #######7   |  77% [A
2025-05-07T20:25:33.7563613Z nsight-compute-2024. | 443.1 MB  | ####6      |  47% 
2025-05-07T20:25:33.7565396Z 
2025-05-07T20:25:33.7571941Z libcublas-12.6.4.1   | 256.2 MB  | #######9   |  79% [A
2025-05-07T20:25:33.8582736Z nsight-compute-2024. | 443.1 MB  | ####8      |  48% 
2025-05-07T20:25:33.8670518Z nsight-compute-2024. | 443.1 MB  | ####9      |  50% 
2025-05-07T20:25:33.8670813Z 
2025-05-07T20:25:33.9499804Z libcublas-12.6.4.1   | 256.2 MB  | ########   |  81% [A
2025-05-07T20:25:33.9500068Z 
2025-05-07T20:25:33.9500089Z 
2025-05-07T20:25:33.9500093Z 
2025-05-07T20:25:33.9501332Z 
2025-05-07T20:25:33.9673268Z cuda-nsight-12.6.77  | 113.2 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:25:33.9675267Z 
2025-05-07T20:25:33.9765522Z libcublas-12.6.4.1   | 256.2 MB  | ########2  |  83% [A
2025-05-07T20:25:33.9922400Z nsight-compute-2024. | 443.1 MB  | #####1     |  51% 
2025-05-07T20:25:33.9922752Z 
2025-05-07T20:25:33.9922757Z 
2025-05-07T20:25:33.9922763Z 
2025-05-07T20:25:33.9922768Z 
2025-05-07T20:25:33.9922774Z 
2025-05-07T20:25:34.0710864Z cuda-nvvp-12.6.80    | 109.3 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:25:34.0711243Z 
2025-05-07T20:25:34.0926521Z libcublas-12.6.4.1   | 256.2 MB  | ########4  |  84% [A
2025-05-07T20:25:34.0926831Z 
2025-05-07T20:25:34.0926835Z 
2025-05-07T20:25:34.0926839Z 
2025-05-07T20:25:34.0926843Z 
2025-05-07T20:25:34.0926961Z 
2025-05-07T20:25:34.1149975Z cuda-nvvp-12.6.80    | 109.3 MB  | 3          |   3% [A[A[A[A[A
2025-05-07T20:25:34.1921337Z nsight-compute-2024. | 443.1 MB  | #####2     |  52% 
2025-05-07T20:25:34.1921755Z 
2025-05-07T20:25:34.1926625Z libcublas-12.6.4.1   | 256.2 MB  | ########6  |  86% [A
2025-05-07T20:25:34.1926988Z 
2025-05-07T20:25:34.1926993Z 
2025-05-07T20:25:34.1926997Z 
2025-05-07T20:25:34.1927199Z 
2025-05-07T20:25:34.1929843Z 
2025-05-07T20:25:34.2575061Z cuda-nvvp-12.6.80    | 109.3 MB  | 5          |   6% [A[A[A[A[A
2025-05-07T20:25:34.2927326Z nsight-compute-2024. | 443.1 MB  | #####3     |  53% 
2025-05-07T20:25:34.2927701Z 
2025-05-07T20:25:34.2927708Z 
2025-05-07T20:25:34.2927713Z 
2025-05-07T20:25:34.2927729Z 
2025-05-07T20:25:34.2930568Z 
2025-05-07T20:25:34.3113732Z cuda-nvvp-12.6.80    | 109.3 MB  | 9          |   9% [A[A[A[A[A
2025-05-07T20:25:34.3114149Z 
2025-05-07T20:25:34.3145671Z libcublas-12.6.4.1   | 256.2 MB  | ########7  |  88% [A
2025-05-07T20:25:34.3145927Z 
2025-05-07T20:25:34.3145931Z 
2025-05-07T20:25:34.3145935Z 
2025-05-07T20:25:34.3148654Z libcusparse-12.5.4.2 | 118.6 MB  | ########## | 100% [A[A[A
2025-05-07T20:25:34.3149009Z 
2025-05-07T20:25:34.3149015Z 
2025-05-07T20:25:34.3155304Z 
2025-05-07T20:25:34.3668552Z libcusparse-12.5.4.2 | 118.6 MB  | ########## | 100% [A[A[A
2025-05-07T20:25:34.3668967Z 
2025-05-07T20:25:34.3668973Z 
2025-05-07T20:25:34.3668999Z 
2025-05-07T20:25:34.3669002Z 
2025-05-07T20:25:34.3669006Z 
2025-05-07T20:25:34.3672615Z 
2025-05-07T20:25:34.3929894Z libcusolver-11.7.1.2 | 95.8 MB   |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:34.3930211Z 
2025-05-07T20:25:34.3930217Z 
2025-05-07T20:25:34.3930222Z 
2025-05-07T20:25:34.3930227Z 
2025-05-07T20:25:34.3930233Z 
2025-05-07T20:25:34.3945412Z cuda-nvvp-12.6.80    | 109.3 MB  | #2         |  12% [A[A[A[A[A
2025-05-07T20:25:34.4289798Z nsight-compute-2024. | 443.1 MB  | #####4     |  54% 
2025-05-07T20:25:34.4294286Z 
2025-05-07T20:25:34.4669961Z libcublas-12.6.4.1   | 256.2 MB  | ########9  |  89% [A
2025-05-07T20:25:34.4670247Z 
2025-05-07T20:25:34.4670251Z 
2025-05-07T20:25:34.4670255Z 
2025-05-07T20:25:34.4670278Z 
2025-05-07T20:25:34.4670282Z 
2025-05-07T20:25:34.4672634Z 
2025-05-07T20:25:34.4969656Z libcusolver-11.7.1.2 | 95.8 MB   | 3          |   3% [A[A[A[A[A[A
2025-05-07T20:25:34.4970016Z 
2025-05-07T20:25:34.4970022Z 
2025-05-07T20:25:34.4970050Z 
2025-05-07T20:25:34.4970055Z 
2025-05-07T20:25:34.4972700Z 
2025-05-07T20:25:34.5415022Z cuda-nvvp-12.6.80    | 109.3 MB  | #5         |  15% [A[A[A[A[A
2025-05-07T20:25:34.5647056Z nsight-compute-2024. | 443.1 MB  | #####5     |  55% 
2025-05-07T20:25:34.5647461Z 
2025-05-07T20:25:34.5672128Z libcublas-12.6.4.1   | 256.2 MB  | #########  |  91% [A
2025-05-07T20:25:34.5672452Z 
2025-05-07T20:25:34.5672468Z 
2025-05-07T20:25:34.5672472Z 
2025-05-07T20:25:34.5672476Z 
2025-05-07T20:25:34.5672480Z 
2025-05-07T20:25:34.5674259Z 
2025-05-07T20:25:34.6048066Z libcusolver-11.7.1.2 | 95.8 MB   | 6          |   6% [A[A[A[A[A[A
2025-05-07T20:25:34.6048493Z 
2025-05-07T20:25:34.6048498Z 
2025-05-07T20:25:34.6048501Z 
2025-05-07T20:25:34.6048526Z 
2025-05-07T20:25:34.6051353Z 
2025-05-07T20:25:34.6673876Z cuda-nvvp-12.6.80    | 109.3 MB  | #7         |  18% [A[A[A[A[A
2025-05-07T20:25:34.6674228Z 
2025-05-07T20:25:34.6674232Z 
2025-05-07T20:25:34.6674236Z 
2025-05-07T20:25:34.6674240Z 
2025-05-07T20:25:34.6674264Z 
2025-05-07T20:25:34.6677074Z 
2025-05-07T20:25:34.6887230Z libcusolver-11.7.1.2 | 95.8 MB   | 8          |   9% [A[A[A[A[A[A
2025-05-07T20:25:34.6887701Z 
2025-05-07T20:25:34.6889833Z libcublas-12.6.4.1   | 256.2 MB  | #########1 |  92% [A
2025-05-07T20:25:34.7079208Z nsight-compute-2024. | 443.1 MB  | #####6     |  56% 
2025-05-07T20:25:34.7079466Z 
2025-05-07T20:25:34.7079470Z 
2025-05-07T20:25:34.7079474Z 
2025-05-07T20:25:34.7079477Z 
2025-05-07T20:25:34.7083028Z 
2025-05-07T20:25:34.7678738Z cuda-nvvp-12.6.80    | 109.3 MB  | ##         |  21% [A[A[A[A[A
2025-05-07T20:25:34.7679065Z 
2025-05-07T20:25:34.7679069Z 
2025-05-07T20:25:34.7679073Z 
2025-05-07T20:25:34.7679083Z 
2025-05-07T20:25:34.7679087Z 
2025-05-07T20:25:34.7681797Z 
2025-05-07T20:25:34.8071676Z libcusolver-11.7.1.2 | 95.8 MB   | #1         |  12% [A[A[A[A[A[A
2025-05-07T20:25:34.8072138Z 
2025-05-07T20:25:34.8084299Z libcublas-12.6.4.1   | 256.2 MB  | #########3 |  93% [A
2025-05-07T20:25:34.8084652Z 
2025-05-07T20:25:34.8084924Z 
2025-05-07T20:25:34.8084930Z 
2025-05-07T20:25:34.8084935Z 
2025-05-07T20:25:34.8088197Z 
2025-05-07T20:25:34.8118840Z cuda-nvvp-12.6.80    | 109.3 MB  | ##3        |  23% [A[A[A[A[A
2025-05-07T20:25:34.8789259Z nsight-compute-2024. | 443.1 MB  | #####6     |  57% 
2025-05-07T20:25:34.8789608Z 
2025-05-07T20:25:34.8789615Z 
2025-05-07T20:25:34.8789620Z 
2025-05-07T20:25:34.8789625Z 
2025-05-07T20:25:34.8789631Z 
2025-05-07T20:25:34.8789637Z 
2025-05-07T20:25:34.9161347Z libcusolver-11.7.1.2 | 95.8 MB   | #4         |  14% [A[A[A[A[A[A
2025-05-07T20:25:34.9166881Z 
2025-05-07T20:25:34.9215329Z libcublas-12.6.4.1   | 256.2 MB  | #########4 |  94% [A
2025-05-07T20:25:34.9215687Z 
2025-05-07T20:25:34.9215706Z 
2025-05-07T20:25:34.9215712Z 
2025-05-07T20:25:34.9215718Z 
2025-05-07T20:25:34.9215897Z 
2025-05-07T20:25:34.9364716Z cuda-nvvp-12.6.80    | 109.3 MB  | ##5        |  26% [A[A[A[A[A
2025-05-07T20:25:34.9793310Z nsight-compute-2024. | 443.1 MB  | #####7     |  58% 
2025-05-07T20:25:34.9793671Z 
2025-05-07T20:25:34.9793676Z 
2025-05-07T20:25:34.9793682Z 
2025-05-07T20:25:34.9793687Z 
2025-05-07T20:25:34.9793693Z 
2025-05-07T20:25:34.9793703Z 
2025-05-07T20:25:35.0381893Z libcusolver-11.7.1.2 | 95.8 MB   | #7         |  17% [A[A[A[A[A[A
2025-05-07T20:25:35.0382300Z 
2025-05-07T20:25:35.0382306Z 
2025-05-07T20:25:35.0382312Z 
2025-05-07T20:25:35.0382317Z 
2025-05-07T20:25:35.0387177Z 
2025-05-07T20:25:35.0432256Z cuda-nvvp-12.6.80    | 109.3 MB  | ##8        |  29% [A[A[A[A[A
2025-05-07T20:25:35.0432638Z 
2025-05-07T20:25:35.0765697Z libcublas-12.6.4.1   | 256.2 MB  | #########5 |  96% [A
2025-05-07T20:25:35.0818114Z nsight-compute-2024. | 443.1 MB  | #####8     |  58% 
2025-05-07T20:25:35.0818476Z 
2025-05-07T20:25:35.0818483Z 
2025-05-07T20:25:35.0818489Z 
2025-05-07T20:25:35.0818494Z 
2025-05-07T20:25:35.0818499Z 
2025-05-07T20:25:35.0818504Z 
2025-05-07T20:25:35.1468617Z libcusolver-11.7.1.2 | 95.8 MB   | #9         |  20% [A[A[A[A[A[A
2025-05-07T20:25:35.1469024Z 
2025-05-07T20:25:35.1469029Z 
2025-05-07T20:25:35.1469034Z 
2025-05-07T20:25:35.1469039Z 
2025-05-07T20:25:35.1470637Z 
2025-05-07T20:25:35.1615496Z cuda-nvvp-12.6.80    | 109.3 MB  | ###1       |  31% [A[A[A[A[A
2025-05-07T20:25:35.1622770Z 
2025-05-07T20:25:35.1869391Z libcublas-12.6.4.1   | 256.2 MB  | #########6 |  97% [A
2025-05-07T20:25:35.1869761Z 
2025-05-07T20:25:35.1869767Z 
2025-05-07T20:25:35.1869855Z 
2025-05-07T20:25:35.1869861Z 
2025-05-07T20:25:35.1869866Z 
2025-05-07T20:25:35.1873567Z 
2025-05-07T20:25:35.1946738Z libcusolver-11.7.1.2 | 95.8 MB   | ##2        |  22% [A[A[A[A[A[A
2025-05-07T20:25:35.2469497Z nsight-compute-2024. | 443.1 MB  | #####8     |  59% 
2025-05-07T20:25:35.2469973Z 
2025-05-07T20:25:35.2469999Z 
2025-05-07T20:25:35.2470006Z 
2025-05-07T20:25:35.2470011Z 
2025-05-07T20:25:35.2472073Z 
2025-05-07T20:25:35.2662279Z cuda-nvvp-12.6.80    | 109.3 MB  | ###3       |  34% [A[A[A[A[A
2025-05-07T20:25:35.2664702Z 
2025-05-07T20:25:35.2876500Z libcublas-12.6.4.1   | 256.2 MB  | #########7 |  98% [A
2025-05-07T20:25:35.2876867Z 
2025-05-07T20:25:35.2876872Z 
2025-05-07T20:25:35.2876875Z 
2025-05-07T20:25:35.2876879Z 
2025-05-07T20:25:35.2876883Z 
2025-05-07T20:25:35.2878752Z 
2025-05-07T20:25:35.3004667Z libcusolver-11.7.1.2 | 95.8 MB   | ##5        |  25% [A[A[A[A[A[A
2025-05-07T20:25:35.3472167Z nsight-compute-2024. | 443.1 MB  | #####9     |  60% 
2025-05-07T20:25:35.3472453Z 
2025-05-07T20:25:35.3472457Z 
2025-05-07T20:25:35.3472461Z 
2025-05-07T20:25:35.3472465Z 
2025-05-07T20:25:35.3474458Z 
2025-05-07T20:25:35.3667533Z cuda-nvvp-12.6.80    | 109.3 MB  | ###6       |  37% [A[A[A[A[A
2025-05-07T20:25:35.3669499Z 
2025-05-07T20:25:35.3877971Z libcublas-12.6.4.1   | 256.2 MB  | #########8 |  99% [A
2025-05-07T20:25:35.3878331Z 
2025-05-07T20:25:35.3878335Z 
2025-05-07T20:25:35.3878345Z 
2025-05-07T20:25:35.3878349Z 
2025-05-07T20:25:35.3878352Z 
2025-05-07T20:25:35.3883151Z 
2025-05-07T20:25:35.4009542Z libcusolver-11.7.1.2 | 95.8 MB   | ##7        |  28% [A[A[A[A[A[A
2025-05-07T20:25:35.4472920Z nsight-compute-2024. | 443.1 MB  | ######     |  60% 
2025-05-07T20:25:35.4473232Z 
2025-05-07T20:25:35.4473236Z 
2025-05-07T20:25:35.4473248Z 
2025-05-07T20:25:35.4473252Z 
2025-05-07T20:25:35.4476437Z 
2025-05-07T20:25:35.4692557Z cuda-nvvp-12.6.80    | 109.3 MB  | ###9       |  40% [A[A[A[A[A
2025-05-07T20:25:35.4692843Z 
2025-05-07T20:25:35.4878578Z libcublas-12.6.4.1   | 256.2 MB  | #########9 | 100% [A
2025-05-07T20:25:35.4878882Z 
2025-05-07T20:25:35.4878885Z 
2025-05-07T20:25:35.4878889Z 
2025-05-07T20:25:35.4878893Z 
2025-05-07T20:25:35.4878896Z 
2025-05-07T20:25:35.4880976Z 
2025-05-07T20:25:35.5078407Z libcusolver-11.7.1.2 | 95.8 MB   | ###1       |  31% [A[A[A[A[A[A
2025-05-07T20:25:35.5473852Z nsight-compute-2024. | 443.1 MB  | ######     |  61% 
2025-05-07T20:25:35.5474113Z 
2025-05-07T20:25:35.5474118Z 
2025-05-07T20:25:35.5474122Z 
2025-05-07T20:25:35.5474125Z 
2025-05-07T20:25:35.5476064Z 
2025-05-07T20:25:35.5894918Z cuda-nvvp-12.6.80    | 109.3 MB  | ####2      |  42% [A[A[A[A[A
2025-05-07T20:25:35.5895270Z 
2025-05-07T20:25:35.5895273Z 
2025-05-07T20:25:35.5895277Z 
2025-05-07T20:25:35.5895281Z 
2025-05-07T20:25:35.5895284Z 
2025-05-07T20:25:35.5897786Z 
2025-05-07T20:25:35.6081234Z libcusolver-11.7.1.2 | 95.8 MB   | ###4       |  34% [A[A[A[A[A[A
2025-05-07T20:25:35.6477467Z nsight-compute-2024. | 443.1 MB  | ######1    |  61% 
2025-05-07T20:25:35.6477752Z 
2025-05-07T20:25:35.6477758Z 
2025-05-07T20:25:35.6477763Z 
2025-05-07T20:25:35.6477768Z 
2025-05-07T20:25:35.6479374Z 
2025-05-07T20:25:35.6906767Z cuda-nvvp-12.6.80    | 109.3 MB  | ####5      |  46% [A[A[A[A[A
2025-05-07T20:25:35.6907127Z 
2025-05-07T20:25:35.6907133Z 
2025-05-07T20:25:35.6907166Z 
2025-05-07T20:25:35.6907172Z 
2025-05-07T20:25:35.6907177Z 
2025-05-07T20:25:35.6907182Z 
2025-05-07T20:25:35.7084198Z libcusolver-11.7.1.2 | 95.8 MB   | ###7       |  37% [A[A[A[A[A[A
2025-05-07T20:25:35.7478757Z nsight-compute-2024. | 443.1 MB  | ######2    |  62% 
2025-05-07T20:25:35.7479021Z 
2025-05-07T20:25:35.7479025Z 
2025-05-07T20:25:35.7479029Z 
2025-05-07T20:25:35.7479032Z 
2025-05-07T20:25:35.7480786Z 
2025-05-07T20:25:35.7900282Z cuda-nvvp-12.6.80    | 109.3 MB  | ####8      |  49% [A[A[A[A[A
2025-05-07T20:25:35.7900557Z 
2025-05-07T20:25:35.7900561Z 
2025-05-07T20:25:35.7900565Z 
2025-05-07T20:25:35.7900569Z 
2025-05-07T20:25:35.7900572Z 
2025-05-07T20:25:35.7902879Z 
2025-05-07T20:25:35.8087076Z libcusolver-11.7.1.2 | 95.8 MB   | ####       |  40% [A[A[A[A[A[A
2025-05-07T20:25:35.8480249Z nsight-compute-2024. | 443.1 MB  | ######2    |  63% 
2025-05-07T20:25:35.8480494Z 
2025-05-07T20:25:35.8480498Z 
2025-05-07T20:25:35.8480502Z 
2025-05-07T20:25:35.8480517Z 
2025-05-07T20:25:35.8483880Z 
2025-05-07T20:25:35.8925889Z cuda-nvvp-12.6.80    | 109.3 MB  | #####1     |  52% [A[A[A[A[A
2025-05-07T20:25:35.8926158Z 
2025-05-07T20:25:35.8926162Z 
2025-05-07T20:25:35.8926166Z 
2025-05-07T20:25:35.8926170Z 
2025-05-07T20:25:35.8926186Z 
2025-05-07T20:25:35.8928144Z 
2025-05-07T20:25:35.9089572Z libcusolver-11.7.1.2 | 95.8 MB   | ####3      |  43% [A[A[A[A[A[A
2025-05-07T20:25:35.9533502Z nsight-compute-2024. | 443.1 MB  | ######3    |  63% 
2025-05-07T20:25:35.9533780Z 
2025-05-07T20:25:35.9533785Z 
2025-05-07T20:25:35.9533791Z 
2025-05-07T20:25:35.9533796Z 
2025-05-07T20:25:35.9536229Z 
2025-05-07T20:25:35.9952414Z cuda-nvvp-12.6.80    | 109.3 MB  | #####5     |  55% [A[A[A[A[A
2025-05-07T20:25:35.9952700Z 
2025-05-07T20:25:35.9952704Z 
2025-05-07T20:25:35.9952708Z 
2025-05-07T20:25:35.9952719Z 
2025-05-07T20:25:35.9952722Z 
2025-05-07T20:25:35.9958699Z 
2025-05-07T20:25:36.0153449Z libcusolver-11.7.1.2 | 95.8 MB   | ####6      |  46% [A[A[A[A[A[A
2025-05-07T20:25:36.0153809Z 
2025-05-07T20:25:36.0153824Z 
2025-05-07T20:25:36.0201914Z libcufft-11.3.0.4    | 156.2 MB  | ########## | 100% [A[A
2025-05-07T20:25:36.0564375Z nsight-compute-2024. | 443.1 MB  | ######4    |  64% 
2025-05-07T20:25:36.0564682Z 
2025-05-07T20:25:36.0564935Z 
2025-05-07T20:25:36.0564941Z 
2025-05-07T20:25:36.0564945Z 
2025-05-07T20:25:36.0564949Z 
2025-05-07T20:25:36.0564952Z 
2025-05-07T20:25:36.0566405Z 
2025-05-07T20:25:36.0600445Z libnpp-12.3.1.54     | 93.4 MB   |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:36.0600828Z 
2025-05-07T20:25:36.0600834Z 
2025-05-07T20:25:36.0600839Z 
2025-05-07T20:25:36.0600845Z 
2025-05-07T20:25:36.0600850Z 
2025-05-07T20:25:36.1104530Z cuda-nvvp-12.6.80    | 109.3 MB  | #####8     |  58% [A[A[A[A[A
2025-05-07T20:25:36.1104910Z 
2025-05-07T20:25:36.1104916Z 
2025-05-07T20:25:36.1104921Z 
2025-05-07T20:25:36.1104926Z 
2025-05-07T20:25:36.1104931Z 
2025-05-07T20:25:36.1106832Z 
2025-05-07T20:25:36.1383567Z libcusolver-11.7.1.2 | 95.8 MB   | ####8      |  49% [A[A[A[A[A[A
2025-05-07T20:25:36.1568633Z nsight-compute-2024. | 443.1 MB  | ######4    |  65% 
2025-05-07T20:25:36.1568895Z 
2025-05-07T20:25:36.1568899Z 
2025-05-07T20:25:36.1568903Z 
2025-05-07T20:25:36.1568906Z 
2025-05-07T20:25:36.1568922Z 
2025-05-07T20:25:36.1568926Z 
2025-05-07T20:25:36.1572756Z 
2025-05-07T20:25:36.1716862Z libnpp-12.3.1.54     | 93.4 MB   | 2          |   3% [A[A[A[A[A[A[A
2025-05-07T20:25:36.1717254Z 
2025-05-07T20:25:36.1717258Z 
2025-05-07T20:25:36.1717262Z 
2025-05-07T20:25:36.1717266Z 
2025-05-07T20:25:36.1717270Z 
2025-05-07T20:25:36.2170655Z cuda-nvvp-12.6.80    | 109.3 MB  | ######     |  61% [A[A[A[A[A
2025-05-07T20:25:36.2170999Z 
2025-05-07T20:25:36.2171005Z 
2025-05-07T20:25:36.2171010Z 
2025-05-07T20:25:36.2171015Z 
2025-05-07T20:25:36.2171021Z 
2025-05-07T20:25:36.2174264Z 
2025-05-07T20:25:36.2394755Z libcusolver-11.7.1.2 | 95.8 MB   | #####1     |  52% [A[A[A[A[A[A
2025-05-07T20:25:36.2570847Z nsight-compute-2024. | 443.1 MB  | ######5    |  65% 
2025-05-07T20:25:36.2571237Z 
2025-05-07T20:25:36.2571243Z 
2025-05-07T20:25:36.2571248Z 
2025-05-07T20:25:36.2571253Z 
2025-05-07T20:25:36.2571259Z 
2025-05-07T20:25:36.2571264Z 
2025-05-07T20:25:36.2573601Z 
2025-05-07T20:25:36.2795642Z libnpp-12.3.1.54     | 93.4 MB   | 5          |   5% [A[A[A[A[A[A[A
2025-05-07T20:25:36.2795928Z 
2025-05-07T20:25:36.2795932Z 
2025-05-07T20:25:36.2795936Z 
2025-05-07T20:25:36.2795940Z 
2025-05-07T20:25:36.2795943Z 
2025-05-07T20:25:36.3216007Z cuda-nvvp-12.6.80    | 109.3 MB  | ######3    |  64% [A[A[A[A[A
2025-05-07T20:25:36.3216288Z 
2025-05-07T20:25:36.3216292Z 
2025-05-07T20:25:36.3216296Z 
2025-05-07T20:25:36.3216300Z 
2025-05-07T20:25:36.3216303Z 
2025-05-07T20:25:36.3216318Z 
2025-05-07T20:25:36.3425712Z libcusolver-11.7.1.2 | 95.8 MB   | #####4     |  54% [A[A[A[A[A[A
2025-05-07T20:25:36.3571525Z nsight-compute-2024. | 443.1 MB  | ######5    |  66% 
2025-05-07T20:25:36.3571777Z 
2025-05-07T20:25:36.3571795Z 
2025-05-07T20:25:36.3571799Z 
2025-05-07T20:25:36.3571803Z 
2025-05-07T20:25:36.3571807Z 
2025-05-07T20:25:36.3571810Z 
2025-05-07T20:25:36.3573731Z 
2025-05-07T20:25:36.3942835Z libnpp-12.3.1.54     | 93.4 MB   | 7          |   8% [A[A[A[A[A[A[A
2025-05-07T20:25:36.3943159Z 
2025-05-07T20:25:36.3943165Z 
2025-05-07T20:25:36.3943170Z 
2025-05-07T20:25:36.3943175Z 
2025-05-07T20:25:36.3943180Z 
2025-05-07T20:25:36.4290289Z cuda-nvvp-12.6.80    | 109.3 MB  | ######6    |  67% [A[A[A[A[A
2025-05-07T20:25:36.4290686Z 
2025-05-07T20:25:36.4290692Z 
2025-05-07T20:25:36.4290698Z 
2025-05-07T20:25:36.4290704Z 
2025-05-07T20:25:36.4290709Z 
2025-05-07T20:25:36.4292327Z 
2025-05-07T20:25:36.4442106Z libcusolver-11.7.1.2 | 95.8 MB   | #####7     |  57% [A[A[A[A[A[A
2025-05-07T20:25:36.4580170Z nsight-compute-2024. | 443.1 MB  | ######6    |  66% 
2025-05-07T20:25:36.4580443Z 
2025-05-07T20:25:36.4580447Z 
2025-05-07T20:25:36.4580451Z 
2025-05-07T20:25:36.4580455Z 
2025-05-07T20:25:36.4580459Z 
2025-05-07T20:25:36.4580700Z 
2025-05-07T20:25:36.4583967Z 
2025-05-07T20:25:36.5079616Z libnpp-12.3.1.54     | 93.4 MB   | #          |  11% [A[A[A[A[A[A[A
2025-05-07T20:25:36.5080003Z 
2025-05-07T20:25:36.5080007Z 
2025-05-07T20:25:36.5080011Z 
2025-05-07T20:25:36.5080234Z 
2025-05-07T20:25:36.5080988Z 
2025-05-07T20:25:36.5293058Z cuda-nvvp-12.6.80    | 109.3 MB  | ######9    |  69% [A[A[A[A[A
2025-05-07T20:25:36.5293427Z 
2025-05-07T20:25:36.5293431Z 
2025-05-07T20:25:36.5293435Z 
2025-05-07T20:25:36.5293438Z 
2025-05-07T20:25:36.5293442Z 
2025-05-07T20:25:36.5297129Z 
2025-05-07T20:25:36.5557930Z libcusolver-11.7.1.2 | 95.8 MB   | #####9     |  60% [A[A[A[A[A[A
2025-05-07T20:25:36.5586468Z nsight-compute-2024. | 443.1 MB  | ######7    |  67% 
2025-05-07T20:25:36.5586749Z 
2025-05-07T20:25:36.5586753Z 
2025-05-07T20:25:36.5586757Z 
2025-05-07T20:25:36.5586761Z 
2025-05-07T20:25:36.5586764Z 
2025-05-07T20:25:36.5586768Z 
2025-05-07T20:25:36.5588470Z 
2025-05-07T20:25:36.6259973Z libnpp-12.3.1.54     | 93.4 MB   | #3         |  14% [A[A[A[A[A[A[A
2025-05-07T20:25:36.6260277Z 
2025-05-07T20:25:36.6260281Z 
2025-05-07T20:25:36.6260284Z 
2025-05-07T20:25:36.6260288Z 
2025-05-07T20:25:36.6260292Z 
2025-05-07T20:25:36.6294227Z cuda-nvvp-12.6.80    | 109.3 MB  | #######1   |  72% [A[A[A[A[A
2025-05-07T20:25:36.6294536Z 
2025-05-07T20:25:36.6294540Z 
2025-05-07T20:25:36.6294543Z 
2025-05-07T20:25:36.6294547Z 
2025-05-07T20:25:36.6294551Z 
2025-05-07T20:25:36.6294555Z 
2025-05-07T20:25:36.6588955Z libcusolver-11.7.1.2 | 95.8 MB   | ######2    |  63% [A[A[A[A[A[A
2025-05-07T20:25:36.6589389Z 
2025-05-07T20:25:36.6589395Z 
2025-05-07T20:25:36.6589400Z 
2025-05-07T20:25:36.6589405Z 
2025-05-07T20:25:36.6589410Z 
2025-05-07T20:25:36.6589416Z 
2025-05-07T20:25:36.6591786Z 
2025-05-07T20:25:36.6699482Z libnpp-12.3.1.54     | 93.4 MB   | #6         |  16% [A[A[A[A[A[A[A
2025-05-07T20:25:36.7296252Z nsight-compute-2024. | 443.1 MB  | ######7    |  68% 
2025-05-07T20:25:36.7296592Z 
2025-05-07T20:25:36.7296624Z 
2025-05-07T20:25:36.7296630Z 
2025-05-07T20:25:36.7296635Z 
2025-05-07T20:25:36.7296639Z 
2025-05-07T20:25:36.7300554Z 
2025-05-07T20:25:36.7335089Z libcusolver-11.7.1.2 | 95.8 MB   | ######5    |  66% [A[A[A[A[A[A
2025-05-07T20:25:36.7335549Z 
2025-05-07T20:25:36.7335571Z 
2025-05-07T20:25:36.7335577Z 
2025-05-07T20:25:36.7335582Z 
2025-05-07T20:25:36.7335588Z 
2025-05-07T20:25:36.7591389Z cuda-nvvp-12.6.80    | 109.3 MB  | #######4   |  74% [A[A[A[A[A
2025-05-07T20:25:36.7591678Z 
2025-05-07T20:25:36.7591682Z 
2025-05-07T20:25:36.7591686Z 
2025-05-07T20:25:36.7591689Z 
2025-05-07T20:25:36.7591693Z 
2025-05-07T20:25:36.7591697Z 
2025-05-07T20:25:36.7593134Z 
2025-05-07T20:25:36.7700063Z libnpp-12.3.1.54     | 93.4 MB   | #8         |  19% [A[A[A[A[A[A[A
2025-05-07T20:25:36.8333545Z nsight-compute-2024. | 443.1 MB  | ######8    |  68% 
2025-05-07T20:25:36.8333825Z 
2025-05-07T20:25:36.8333829Z 
2025-05-07T20:25:36.8333833Z 
2025-05-07T20:25:36.8333837Z 
2025-05-07T20:25:36.8333870Z 
2025-05-07T20:25:36.8333874Z 
2025-05-07T20:25:36.8405108Z libcusolver-11.7.1.2 | 95.8 MB   | ######8    |  68% [A[A[A[A[A[A
2025-05-07T20:25:36.8405821Z 
2025-05-07T20:25:36.8405825Z 
2025-05-07T20:25:36.8405829Z 
2025-05-07T20:25:36.8405833Z 
2025-05-07T20:25:36.8405855Z 
2025-05-07T20:25:36.8622347Z cuda-nvvp-12.6.80    | 109.3 MB  | #######6   |  77% [A[A[A[A[A
2025-05-07T20:25:36.8622633Z 
2025-05-07T20:25:36.8622637Z 
2025-05-07T20:25:36.8622640Z 
2025-05-07T20:25:36.8622644Z 
2025-05-07T20:25:36.8622648Z 
2025-05-07T20:25:36.8622652Z 
2025-05-07T20:25:36.8622655Z 
2025-05-07T20:25:36.8720070Z libnpp-12.3.1.54     | 93.4 MB   | ##1        |  22% [A[A[A[A[A[A[A
2025-05-07T20:25:36.9354832Z nsight-compute-2024. | 443.1 MB  | ######8    |  69% 
2025-05-07T20:25:36.9355098Z 
2025-05-07T20:25:36.9355103Z 
2025-05-07T20:25:36.9355107Z 
2025-05-07T20:25:36.9355110Z 
2025-05-07T20:25:36.9355114Z 
2025-05-07T20:25:36.9356819Z 
2025-05-07T20:25:36.9493163Z libcusolver-11.7.1.2 | 95.8 MB   | #######1   |  71% [A[A[A[A[A[A
2025-05-07T20:25:36.9493471Z 
2025-05-07T20:25:36.9493475Z 
2025-05-07T20:25:36.9493478Z 
2025-05-07T20:25:36.9493482Z 
2025-05-07T20:25:36.9495415Z 
2025-05-07T20:25:36.9639813Z cuda-nvvp-12.6.80    | 109.3 MB  | #######9   |  79% [A[A[A[A[A
2025-05-07T20:25:36.9640419Z 
2025-05-07T20:25:36.9640423Z 
2025-05-07T20:25:36.9640427Z 
2025-05-07T20:25:36.9640431Z 
2025-05-07T20:25:36.9640435Z 
2025-05-07T20:25:36.9640446Z 
2025-05-07T20:25:36.9643492Z 
2025-05-07T20:25:36.9723995Z libnpp-12.3.1.54     | 93.4 MB   | ##4        |  24% [A[A[A[A[A[A[A
2025-05-07T20:25:37.0360100Z nsight-compute-2024. | 443.1 MB  | ######9    |  69% 
2025-05-07T20:25:37.0360653Z 
2025-05-07T20:25:37.0360658Z 
2025-05-07T20:25:37.0360661Z 
2025-05-07T20:25:37.0360666Z 
2025-05-07T20:25:37.0360669Z 
2025-05-07T20:25:37.0361620Z 
2025-05-07T20:25:37.0505393Z libcusolver-11.7.1.2 | 95.8 MB   | #######3   |  74% [A[A[A[A[A[A
2025-05-07T20:25:37.0505805Z 
2025-05-07T20:25:37.0505834Z 
2025-05-07T20:25:37.0505840Z 
2025-05-07T20:25:37.0505845Z 
2025-05-07T20:25:37.0507795Z 
2025-05-07T20:25:37.0641425Z cuda-nvvp-12.6.80    | 109.3 MB  | ########1  |  82% [A[A[A[A[A
2025-05-07T20:25:37.0641735Z 
2025-05-07T20:25:37.0641741Z 
2025-05-07T20:25:37.0641764Z 
2025-05-07T20:25:37.0641769Z 
2025-05-07T20:25:37.0641774Z 
2025-05-07T20:25:37.0641787Z 
2025-05-07T20:25:37.0641794Z 
2025-05-07T20:25:37.0808818Z libnpp-12.3.1.54     | 93.4 MB   | ##7        |  27% [A[A[A[A[A[A[A
2025-05-07T20:25:37.1361897Z nsight-compute-2024. | 443.1 MB  | #######    |  70% 
2025-05-07T20:25:37.1362173Z 
2025-05-07T20:25:37.1362177Z 
2025-05-07T20:25:37.1362180Z 
2025-05-07T20:25:37.1362185Z 
2025-05-07T20:25:37.1362188Z 
2025-05-07T20:25:37.1363820Z 
2025-05-07T20:25:37.1582260Z libcusolver-11.7.1.2 | 95.8 MB   | #######6   |  77% [A[A[A[A[A[A
2025-05-07T20:25:37.1582615Z 
2025-05-07T20:25:37.1582619Z 
2025-05-07T20:25:37.1582623Z 
2025-05-07T20:25:37.1582627Z 
2025-05-07T20:25:37.1587306Z 
2025-05-07T20:25:37.1642477Z cuda-nvvp-12.6.80    | 109.3 MB  | ########3  |  84% [A[A[A[A[A
2025-05-07T20:25:37.1642872Z 
2025-05-07T20:25:37.1642878Z 
2025-05-07T20:25:37.1642883Z 
2025-05-07T20:25:37.1642888Z 
2025-05-07T20:25:37.1642893Z 
2025-05-07T20:25:37.1642923Z 
2025-05-07T20:25:37.1645922Z 
2025-05-07T20:25:37.1835061Z libnpp-12.3.1.54     | 93.4 MB   | ##9        |  30% [A[A[A[A[A[A[A
2025-05-07T20:25:37.2363028Z nsight-compute-2024. | 443.1 MB  | #######    |  71% 
2025-05-07T20:25:37.2363301Z 
2025-05-07T20:25:37.2363305Z 
2025-05-07T20:25:37.2363309Z 
2025-05-07T20:25:37.2363313Z 
2025-05-07T20:25:37.2363317Z 
2025-05-07T20:25:37.2365278Z 
2025-05-07T20:25:37.2645593Z libcusolver-11.7.1.2 | 95.8 MB   | #######9   |  79% [A[A[A[A[A[A
2025-05-07T20:25:37.2645894Z 
2025-05-07T20:25:37.2645898Z 
2025-05-07T20:25:37.2645902Z 
2025-05-07T20:25:37.2645906Z 
2025-05-07T20:25:37.2645909Z 
2025-05-07T20:25:37.2706220Z cuda-nvvp-12.6.80    | 109.3 MB  | ########6  |  86% [A[A[A[A[A
2025-05-07T20:25:37.2706527Z 
2025-05-07T20:25:37.2706531Z 
2025-05-07T20:25:37.2706535Z 
2025-05-07T20:25:37.2706539Z 
2025-05-07T20:25:37.2706543Z 
2025-05-07T20:25:37.2706546Z 
2025-05-07T20:25:37.2706558Z 
2025-05-07T20:25:37.2837060Z libnpp-12.3.1.54     | 93.4 MB   | ###2       |  33% [A[A[A[A[A[A[A
2025-05-07T20:25:37.3363236Z nsight-compute-2024. | 443.1 MB  | #######1   |  71% 
2025-05-07T20:25:37.3363594Z 
2025-05-07T20:25:37.3363599Z 
2025-05-07T20:25:37.3363604Z 
2025-05-07T20:25:37.3363609Z 
2025-05-07T20:25:37.3363614Z 
2025-05-07T20:25:37.3365467Z 
2025-05-07T20:25:37.3659851Z libcusolver-11.7.1.2 | 95.8 MB   | ########2  |  82% [A[A[A[A[A[A
2025-05-07T20:25:37.3660156Z 
2025-05-07T20:25:37.3660160Z 
2025-05-07T20:25:37.3660164Z 
2025-05-07T20:25:37.3660167Z 
2025-05-07T20:25:37.3661439Z 
2025-05-07T20:25:37.3837200Z cuda-nvvp-12.6.80    | 109.3 MB  | ########8  |  89% [A[A[A[A[A
2025-05-07T20:25:37.3837529Z 
2025-05-07T20:25:37.3837534Z 
2025-05-07T20:25:37.3837537Z 
2025-05-07T20:25:37.3837830Z 
2025-05-07T20:25:37.3837838Z 
2025-05-07T20:25:37.3837844Z 
2025-05-07T20:25:37.3839241Z 
2025-05-07T20:25:37.3852993Z libnpp-12.3.1.54     | 93.4 MB   | ###5       |  35% [A[A[A[A[A[A[A
2025-05-07T20:25:37.4364274Z nsight-compute-2024. | 443.1 MB  | #######1   |  72% 
2025-05-07T20:25:37.4364774Z 
2025-05-07T20:25:37.4364778Z 
2025-05-07T20:25:37.4364781Z 
2025-05-07T20:25:37.4364785Z 
2025-05-07T20:25:37.4364789Z 
2025-05-07T20:25:37.4366212Z 
2025-05-07T20:25:37.4760459Z libcusolver-11.7.1.2 | 95.8 MB   | ########5  |  86% [A[A[A[A[A[A
2025-05-07T20:25:37.4760763Z 
2025-05-07T20:25:37.4760767Z 
2025-05-07T20:25:37.4760771Z 
2025-05-07T20:25:37.4760774Z 
2025-05-07T20:25:37.4762394Z 
2025-05-07T20:25:37.4844008Z cuda-nvvp-12.6.80    | 109.3 MB  | #########  |  91% [A[A[A[A[A
2025-05-07T20:25:37.4844293Z 
2025-05-07T20:25:37.4844297Z 
2025-05-07T20:25:37.4844301Z 
2025-05-07T20:25:37.4844305Z 
2025-05-07T20:25:37.4844309Z 
2025-05-07T20:25:37.4844313Z 
2025-05-07T20:25:37.4844333Z 
2025-05-07T20:25:37.4861121Z libnpp-12.3.1.54     | 93.4 MB   | ###8       |  38% [A[A[A[A[A[A[A
2025-05-07T20:25:37.5428319Z nsight-compute-2024. | 443.1 MB  | #######2   |  72% 
2025-05-07T20:25:37.5428688Z 
2025-05-07T20:25:37.5428694Z 
2025-05-07T20:25:37.5428725Z 
2025-05-07T20:25:37.5428730Z 
2025-05-07T20:25:37.5428735Z 
2025-05-07T20:25:37.5430596Z 
2025-05-07T20:25:37.5802676Z libcusolver-11.7.1.2 | 95.8 MB   | ########8  |  89% [A[A[A[A[A[A
2025-05-07T20:25:37.5802972Z 
2025-05-07T20:25:37.5802976Z 
2025-05-07T20:25:37.5802980Z 
2025-05-07T20:25:37.5802984Z 
2025-05-07T20:25:37.5803035Z 
2025-05-07T20:25:37.5864999Z cuda-nvvp-12.6.80    | 109.3 MB  | #########3 |  93% [A[A[A[A[A
2025-05-07T20:25:37.5959891Z nsight-compute-2024. | 443.1 MB  | #######3   |  73% 
2025-05-07T20:25:37.5960149Z 
2025-05-07T20:25:37.5960153Z 
2025-05-07T20:25:37.5960157Z 
2025-05-07T20:25:37.5960161Z 
2025-05-07T20:25:37.5960164Z 
2025-05-07T20:25:37.5960168Z 
2025-05-07T20:25:37.5963090Z 
2025-05-07T20:25:37.6429985Z libnpp-12.3.1.54     | 93.4 MB   | ####       |  41% [A[A[A[A[A[A[A
2025-05-07T20:25:37.6430318Z 
2025-05-07T20:25:37.6430324Z 
2025-05-07T20:25:37.6430330Z 
2025-05-07T20:25:37.6430335Z 
2025-05-07T20:25:37.6430340Z 
2025-05-07T20:25:37.6430371Z 
2025-05-07T20:25:37.6813490Z libcusolver-11.7.1.2 | 95.8 MB   | #########1 |  92% [A[A[A[A[A[A
2025-05-07T20:25:37.6813799Z 
2025-05-07T20:25:37.6813803Z 
2025-05-07T20:25:37.6813806Z 
2025-05-07T20:25:37.6813810Z 
2025-05-07T20:25:37.6813813Z 
2025-05-07T20:25:37.6904740Z cuda-nvvp-12.6.80    | 109.3 MB  | #########5 |  95% [A[A[A[A[A
2025-05-07T20:25:37.6960766Z nsight-compute-2024. | 443.1 MB  | #######3   |  74% 
2025-05-07T20:25:37.6961030Z 
2025-05-07T20:25:37.6961037Z 
2025-05-07T20:25:37.6961041Z 
2025-05-07T20:25:37.6961044Z 
2025-05-07T20:25:37.6961048Z 
2025-05-07T20:25:37.6961052Z 
2025-05-07T20:25:37.6964183Z 
2025-05-07T20:25:37.7466853Z libnpp-12.3.1.54     | 93.4 MB   | ####3      |  44% [A[A[A[A[A[A[A
2025-05-07T20:25:37.7467178Z 
2025-05-07T20:25:37.7467183Z 
2025-05-07T20:25:37.7467187Z 
2025-05-07T20:25:37.7467190Z 
2025-05-07T20:25:37.7467194Z 
2025-05-07T20:25:37.7468097Z 
2025-05-07T20:25:37.7814393Z libcusolver-11.7.1.2 | 95.8 MB   | #########4 |  95% [A[A[A[A[A[A
2025-05-07T20:25:37.7814762Z 
2025-05-07T20:25:37.7814768Z 
2025-05-07T20:25:37.7814773Z 
2025-05-07T20:25:37.7814779Z 
2025-05-07T20:25:37.7814784Z 
2025-05-07T20:25:37.7931821Z cuda-nvvp-12.6.80    | 109.3 MB  | #########7 |  98% [A[A[A[A[A
2025-05-07T20:25:37.8086239Z nsight-compute-2024. | 443.1 MB  | #######4   |  74% 
2025-05-07T20:25:37.8086500Z 
2025-05-07T20:25:37.8086504Z 
2025-05-07T20:25:37.8086508Z 
2025-05-07T20:25:37.8088572Z 
2025-05-07T20:25:37.8098866Z cuda-nsight-12.6.77  | 113.2 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:25:37.8099168Z 
2025-05-07T20:25:37.8099172Z 
2025-05-07T20:25:37.8099176Z 
2025-05-07T20:25:37.8099180Z 
2025-05-07T20:25:37.8099184Z 
2025-05-07T20:25:37.8099187Z 
2025-05-07T20:25:37.8110717Z 
2025-05-07T20:25:37.8616469Z libnpp-12.3.1.54     | 93.4 MB   | ####6      |  46% [A[A[A[A[A[A[A
2025-05-07T20:25:37.8616768Z 
2025-05-07T20:25:37.8616774Z 
2025-05-07T20:25:37.8616778Z 
2025-05-07T20:25:37.8616782Z 
2025-05-07T20:25:37.8616785Z 
2025-05-07T20:25:37.8619075Z 
2025-05-07T20:25:37.8955840Z libcusolver-11.7.1.2 | 95.8 MB   | #########7 |  98% [A[A[A[A[A[A
2025-05-07T20:25:37.9101748Z nsight-compute-2024. | 443.1 MB  | #######4   |  75% 
2025-05-07T20:25:37.9102008Z 
2025-05-07T20:25:37.9102260Z 
2025-05-07T20:25:37.9102265Z 
2025-05-07T20:25:37.9102269Z 
2025-05-07T20:25:37.9102272Z 
2025-05-07T20:25:37.9102276Z 
2025-05-07T20:25:37.9104875Z 
2025-05-07T20:25:37.9957864Z libnpp-12.3.1.54     | 93.4 MB   | ####9      |  49% [A[A[A[A[A[A[A
2025-05-07T20:25:38.0105098Z nsight-compute-2024. | 443.1 MB  | #######5   |  76% 
2025-05-07T20:25:38.0105360Z 
2025-05-07T20:25:38.0105364Z 
2025-05-07T20:25:38.0105367Z 
2025-05-07T20:25:38.0105371Z 
2025-05-07T20:25:38.0105399Z 
2025-05-07T20:25:38.0105403Z 
2025-05-07T20:25:38.0107172Z 
2025-05-07T20:25:38.0958869Z libnpp-12.3.1.54     | 93.4 MB   | #####2     |  53% [A[A[A[A[A[A[A
2025-05-07T20:25:38.1108501Z nsight-compute-2024. | 443.1 MB  | #######6   |  76% 
2025-05-07T20:25:38.1108877Z 
2025-05-07T20:25:38.1108881Z 
2025-05-07T20:25:38.1108885Z 
2025-05-07T20:25:38.1108888Z 
2025-05-07T20:25:38.1108892Z 
2025-05-07T20:25:38.1108896Z 
2025-05-07T20:25:38.1108900Z 
2025-05-07T20:25:38.1961326Z libnpp-12.3.1.54     | 93.4 MB   | #####6     |  56% [A[A[A[A[A[A[A
2025-05-07T20:25:38.2110186Z nsight-compute-2024. | 443.1 MB  | #######7   |  77% 
2025-05-07T20:25:38.2110561Z 
2025-05-07T20:25:38.2110567Z 
2025-05-07T20:25:38.2110573Z 
2025-05-07T20:25:38.2110578Z 
2025-05-07T20:25:38.2110583Z 
2025-05-07T20:25:38.2110589Z 
2025-05-07T20:25:38.2110595Z 
2025-05-07T20:25:38.2965792Z libnpp-12.3.1.54     | 93.4 MB   | ######     |  60% [A[A[A[A[A[A[A
2025-05-07T20:25:38.3126339Z nsight-compute-2024. | 443.1 MB  | #######7   |  78% 
2025-05-07T20:25:38.3126709Z 
2025-05-07T20:25:38.3126715Z 
2025-05-07T20:25:38.3126720Z 
2025-05-07T20:25:38.3126726Z 
2025-05-07T20:25:38.3126731Z 
2025-05-07T20:25:38.3126737Z 
2025-05-07T20:25:38.3126742Z 
2025-05-07T20:25:38.3969893Z libnpp-12.3.1.54     | 93.4 MB   | ######4    |  64% [A[A[A[A[A[A[A
2025-05-07T20:25:38.4127646Z nsight-compute-2024. | 443.1 MB  | #######8   |  79% 
2025-05-07T20:25:38.4128053Z 
2025-05-07T20:25:38.4128059Z 
2025-05-07T20:25:38.4128064Z 
2025-05-07T20:25:38.4128069Z 
2025-05-07T20:25:38.4128075Z 
2025-05-07T20:25:38.4128080Z 
2025-05-07T20:25:38.4128085Z 
2025-05-07T20:25:38.4971317Z libnpp-12.3.1.54     | 93.4 MB   | ######8    |  68% [A[A[A[A[A[A[A
2025-05-07T20:25:38.5128408Z nsight-compute-2024. | 443.1 MB  | #######9   |  79% 
2025-05-07T20:25:38.5128793Z 
2025-05-07T20:25:38.5128799Z 
2025-05-07T20:25:38.5128804Z 
2025-05-07T20:25:38.5128810Z 
2025-05-07T20:25:38.5128815Z 
2025-05-07T20:25:38.5128821Z 
2025-05-07T20:25:38.5128858Z 
2025-05-07T20:25:38.5973182Z libnpp-12.3.1.54     | 93.4 MB   | #######2   |  72% [A[A[A[A[A[A[A
2025-05-07T20:25:38.6133224Z nsight-compute-2024. | 443.1 MB  | ########   |  80% 
2025-05-07T20:25:38.6133499Z 
2025-05-07T20:25:38.6133503Z 
2025-05-07T20:25:38.6133535Z 
2025-05-07T20:25:38.6133539Z 
2025-05-07T20:25:38.6133542Z 
2025-05-07T20:25:38.6133546Z 
2025-05-07T20:25:38.6133550Z 
2025-05-07T20:25:38.7016373Z libnpp-12.3.1.54     | 93.4 MB   | #######6   |  76% [A[A[A[A[A[A[A
2025-05-07T20:25:38.7136758Z nsight-compute-2024. | 443.1 MB  | ########   |  81% 
2025-05-07T20:25:38.7137055Z 
2025-05-07T20:25:38.7137061Z 
2025-05-07T20:25:38.7137066Z 
2025-05-07T20:25:38.7137071Z 
2025-05-07T20:25:38.7137077Z 
2025-05-07T20:25:38.7137082Z 
2025-05-07T20:25:38.7137087Z 
2025-05-07T20:25:38.8116844Z libnpp-12.3.1.54     | 93.4 MB   | ########   |  80% [A[A[A[A[A[A[A
2025-05-07T20:25:38.8197322Z nsight-compute-2024. | 443.1 MB  | ########1  |  81% 
2025-05-07T20:25:38.8197642Z 
2025-05-07T20:25:38.8197940Z 
2025-05-07T20:25:38.8197949Z 
2025-05-07T20:25:38.8197964Z 
2025-05-07T20:25:38.8197969Z 
2025-05-07T20:25:38.8197974Z 
2025-05-07T20:25:38.8200808Z 
2025-05-07T20:25:38.9119296Z libnpp-12.3.1.54     | 93.4 MB   | ########4  |  84% [A[A[A[A[A[A[A
2025-05-07T20:25:38.9216649Z nsight-compute-2024. | 443.1 MB  | ########2  |  82% 
2025-05-07T20:25:38.9216996Z 
2025-05-07T20:25:38.9217011Z 
2025-05-07T20:25:38.9217016Z 
2025-05-07T20:25:38.9217022Z 
2025-05-07T20:25:38.9217027Z 
2025-05-07T20:25:38.9217032Z 
2025-05-07T20:25:38.9222979Z 
2025-05-07T20:25:39.0212012Z libnpp-12.3.1.54     | 93.4 MB   | ########7  |  88% [A[A[A[A[A[A[A
2025-05-07T20:25:39.0219094Z nsight-compute-2024. | 443.1 MB  | ########2  |  83% 
2025-05-07T20:25:39.0219434Z 
2025-05-07T20:25:39.0219440Z 
2025-05-07T20:25:39.0219444Z 
2025-05-07T20:25:39.0219449Z 
2025-05-07T20:25:39.0219468Z 
2025-05-07T20:25:39.0219473Z 
2025-05-07T20:25:39.0220745Z 
2025-05-07T20:25:39.1213776Z libnpp-12.3.1.54     | 93.4 MB   | #########1 |  92% [A[A[A[A[A[A[A
2025-05-07T20:25:39.1268126Z nsight-compute-2024. | 443.1 MB  | ########3  |  84% 
2025-05-07T20:25:39.1268484Z 
2025-05-07T20:25:39.1268490Z 
2025-05-07T20:25:39.1268496Z 
2025-05-07T20:25:39.1268501Z 
2025-05-07T20:25:39.1268526Z 
2025-05-07T20:25:39.1268531Z 
2025-05-07T20:25:39.1268797Z 
2025-05-07T20:25:39.2274406Z libnpp-12.3.1.54     | 93.4 MB   | #########5 |  96% [A[A[A[A[A[A[A
2025-05-07T20:25:39.2274718Z 
2025-05-07T20:25:39.2274724Z 
2025-05-07T20:25:39.2274730Z 
2025-05-07T20:25:39.2274735Z 
2025-05-07T20:25:39.2274741Z 
2025-05-07T20:25:39.2274746Z 
2025-05-07T20:25:39.2274760Z 
2025-05-07T20:25:39.2307382Z libnpp-12.3.1.54     | 93.4 MB   | #########9 | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:39.3313575Z nsight-compute-2024. | 443.1 MB  | ########4  |  84% 
2025-05-07T20:25:39.4315073Z nsight-compute-2024. | 443.1 MB  | ########5  |  85% 
2025-05-07T20:25:39.5316330Z nsight-compute-2024. | 443.1 MB  | ########6  |  86% 
2025-05-07T20:25:39.6316884Z nsight-compute-2024. | 443.1 MB  | ########6  |  87% 
2025-05-07T20:25:39.7361964Z nsight-compute-2024. | 443.1 MB  | ########7  |  88% 
2025-05-07T20:25:39.8365207Z nsight-compute-2024. | 443.1 MB  | ########8  |  89% 
2025-05-07T20:25:39.9371061Z nsight-compute-2024. | 443.1 MB  | ########9  |  90% 
2025-05-07T20:25:40.0373760Z nsight-compute-2024. | 443.1 MB  | #########  |  91% 
2025-05-07T20:25:40.1375289Z nsight-compute-2024. | 443.1 MB  | #########1 |  92% 
2025-05-07T20:25:40.3631283Z nsight-compute-2024. | 443.1 MB  | #########2 |  93% 
2025-05-07T20:25:40.4632121Z nsight-compute-2024. | 443.1 MB  | #########3 |  94% 
2025-05-07T20:25:40.5635333Z nsight-compute-2024. | 443.1 MB  | #########4 |  94% 
2025-05-07T20:25:40.6417460Z nsight-compute-2024. | 443.1 MB  | #########5 |  95% 
2025-05-07T20:25:40.6417742Z 
2025-05-07T20:25:40.6417746Z 
2025-05-07T20:25:40.6417749Z 
2025-05-07T20:25:40.6417753Z 
2025-05-07T20:25:40.6417757Z 
2025-05-07T20:25:40.6419772Z 
2025-05-07T20:25:40.6635817Z libcusolver-11.7.1.2 | 95.8 MB   | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:40.6931188Z nsight-compute-2024. | 443.1 MB  | #########6 |  96% 
2025-05-07T20:25:40.6931546Z 
2025-05-07T20:25:40.6931552Z 
2025-05-07T20:25:40.6931557Z 
2025-05-07T20:25:40.6931562Z 
2025-05-07T20:25:40.6931587Z 
2025-05-07T20:25:40.6931592Z 
2025-05-07T20:25:40.6931598Z 
2025-05-07T20:25:40.6931603Z 
2025-05-07T20:25:40.7801467Z cuda-nvdisasm-12.6.7 | 47.6 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:40.7934514Z nsight-compute-2024. | 443.1 MB  | #########7 |  97% 
2025-05-07T20:25:40.7934881Z 
2025-05-07T20:25:40.7934887Z 
2025-05-07T20:25:40.7934892Z 
2025-05-07T20:25:40.7934897Z 
2025-05-07T20:25:40.7934902Z 
2025-05-07T20:25:40.7934908Z 
2025-05-07T20:25:40.7934913Z 
2025-05-07T20:25:40.7936515Z 
2025-05-07T20:25:40.8955701Z cuda-nvdisasm-12.6.7 | 47.6 MB   | 6          |   7% [A[A[A[A[A[A[A[A
2025-05-07T20:25:40.9087599Z nsight-compute-2024. | 443.1 MB  | #########7 |  98% 
2025-05-07T20:25:40.9088115Z 
2025-05-07T20:25:40.9088121Z 
2025-05-07T20:25:40.9088124Z 
2025-05-07T20:25:40.9088128Z 
2025-05-07T20:25:40.9088132Z 
2025-05-07T20:25:40.9088136Z 
2025-05-07T20:25:40.9088140Z 
2025-05-07T20:25:40.9089743Z 
2025-05-07T20:25:41.0111809Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #3         |  14% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.0248847Z nsight-compute-2024. | 443.1 MB  | #########8 |  99% 
2025-05-07T20:25:41.0249109Z 
2025-05-07T20:25:41.0249113Z 
2025-05-07T20:25:41.0249117Z 
2025-05-07T20:25:41.0249121Z 
2025-05-07T20:25:41.0249125Z 
2025-05-07T20:25:41.0249129Z 
2025-05-07T20:25:41.0249133Z 
2025-05-07T20:25:41.0251963Z 
2025-05-07T20:25:41.0955353Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ##         |  20% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.0955663Z 
2025-05-07T20:25:41.0955667Z 
2025-05-07T20:25:41.0955671Z 
2025-05-07T20:25:41.0955674Z 
2025-05-07T20:25:41.0955678Z 
2025-05-07T20:25:41.1164757Z cuda-nvvp-12.6.80    | 109.3 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:41.1252367Z nsight-compute-2024. | 443.1 MB  | #########9 | 100% 
2025-05-07T20:25:41.1252714Z 
2025-05-07T20:25:41.1252718Z 
2025-05-07T20:25:41.1252722Z 
2025-05-07T20:25:41.1252726Z 
2025-05-07T20:25:41.1252730Z 
2025-05-07T20:25:41.1252734Z 
2025-05-07T20:25:41.1252747Z 
2025-05-07T20:25:41.1257484Z 
2025-05-07T20:25:41.1488106Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ##6        |  27% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.1488515Z 
2025-05-07T20:25:41.1488519Z 
2025-05-07T20:25:41.1488523Z 
2025-05-07T20:25:41.1488527Z 
2025-05-07T20:25:41.1488530Z 
2025-05-07T20:25:41.1488534Z 
2025-05-07T20:25:41.1488538Z 
2025-05-07T20:25:41.1488556Z 
2025-05-07T20:25:41.1491136Z 
2025-05-07T20:25:41.2257014Z libcurand-10.3.7.77  | 39.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:41.2257422Z 
2025-05-07T20:25:41.2257433Z 
2025-05-07T20:25:41.2257437Z 
2025-05-07T20:25:41.2257441Z 
2025-05-07T20:25:41.2257445Z 
2025-05-07T20:25:41.2257448Z 
2025-05-07T20:25:41.2257452Z 
2025-05-07T20:25:41.2259435Z 
2025-05-07T20:25:41.2488403Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ###3       |  33% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.2488712Z 
2025-05-07T20:25:41.2488716Z 
2025-05-07T20:25:41.2488719Z 
2025-05-07T20:25:41.2488723Z 
2025-05-07T20:25:41.2488727Z 
2025-05-07T20:25:41.2488746Z 
2025-05-07T20:25:41.2488751Z 
2025-05-07T20:25:41.2488756Z 
2025-05-07T20:25:41.2492352Z 
2025-05-07T20:25:41.3257311Z libcurand-10.3.7.77  | 39.9 MB   | 8          |   8% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:41.3257630Z 
2025-05-07T20:25:41.3257634Z 
2025-05-07T20:25:41.3257638Z 
2025-05-07T20:25:41.3257641Z 
2025-05-07T20:25:41.3257645Z 
2025-05-07T20:25:41.3257649Z 
2025-05-07T20:25:41.3257653Z 
2025-05-07T20:25:41.3260992Z 
2025-05-07T20:25:41.3561161Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ####       |  40% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.3561474Z 
2025-05-07T20:25:41.3561478Z 
2025-05-07T20:25:41.3561483Z 
2025-05-07T20:25:41.3561487Z 
2025-05-07T20:25:41.3561490Z 
2025-05-07T20:25:41.3561518Z 
2025-05-07T20:25:41.3561522Z 
2025-05-07T20:25:41.3561526Z 
2025-05-07T20:25:41.3562181Z 
2025-05-07T20:25:41.4261653Z libcurand-10.3.7.77  | 39.9 MB   | #6         |  17% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:41.4261952Z 
2025-05-07T20:25:41.4261956Z 
2025-05-07T20:25:41.4261975Z 
2025-05-07T20:25:41.4261979Z 
2025-05-07T20:25:41.4261983Z 
2025-05-07T20:25:41.4261986Z 
2025-05-07T20:25:41.4261990Z 
2025-05-07T20:25:41.4264148Z 
2025-05-07T20:25:41.4604449Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ####7      |  47% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.4604801Z 
2025-05-07T20:25:41.4604807Z 
2025-05-07T20:25:41.4604813Z 
2025-05-07T20:25:41.4604818Z 
2025-05-07T20:25:41.4604823Z 
2025-05-07T20:25:41.4604829Z 
2025-05-07T20:25:41.4604843Z 
2025-05-07T20:25:41.4604849Z 
2025-05-07T20:25:41.4606093Z 
2025-05-07T20:25:41.5262105Z libcurand-10.3.7.77  | 39.9 MB   | ##5        |  25% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:41.5262424Z 
2025-05-07T20:25:41.5262428Z 
2025-05-07T20:25:41.5262713Z 
2025-05-07T20:25:41.5262719Z 
2025-05-07T20:25:41.5262723Z 
2025-05-07T20:25:41.5262726Z 
2025-05-07T20:25:41.5262730Z 
2025-05-07T20:25:41.5264825Z 
2025-05-07T20:25:41.6266089Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #####4     |  54% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.6266723Z 
2025-05-07T20:25:41.6266727Z 
2025-05-07T20:25:41.6266731Z 
2025-05-07T20:25:41.6266735Z 
2025-05-07T20:25:41.6266739Z 
2025-05-07T20:25:41.6266742Z 
2025-05-07T20:25:41.6266746Z 
2025-05-07T20:25:41.6267044Z 
2025-05-07T20:25:41.6517928Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ######2    |  62% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.6518330Z 
2025-05-07T20:25:41.6518334Z 
2025-05-07T20:25:41.6518338Z 
2025-05-07T20:25:41.6518341Z 
2025-05-07T20:25:41.6518345Z 
2025-05-07T20:25:41.6518356Z 
2025-05-07T20:25:41.6518360Z 
2025-05-07T20:25:41.6518364Z 
2025-05-07T20:25:41.6518367Z 
2025-05-07T20:25:41.7267070Z libcurand-10.3.7.77  | 39.9 MB   | ###2       |  33% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:41.7267449Z 
2025-05-07T20:25:41.7267472Z 
2025-05-07T20:25:41.7267476Z 
2025-05-07T20:25:41.7267480Z 
2025-05-07T20:25:41.7267484Z 
2025-05-07T20:25:41.7267489Z 
2025-05-07T20:25:41.7267492Z 
2025-05-07T20:25:41.7269863Z 
2025-05-07T20:25:41.7521686Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ######9    |  70% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.7522027Z 
2025-05-07T20:25:41.7522032Z 
2025-05-07T20:25:41.7522036Z 
2025-05-07T20:25:41.7522040Z 
2025-05-07T20:25:41.7522043Z 
2025-05-07T20:25:41.7522047Z 
2025-05-07T20:25:41.7522051Z 
2025-05-07T20:25:41.7522055Z 
2025-05-07T20:25:41.7522058Z 
2025-05-07T20:25:41.8272034Z libcurand-10.3.7.77  | 39.9 MB   | ####2      |  42% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:41.8272353Z 
2025-05-07T20:25:41.8272357Z 
2025-05-07T20:25:41.8272360Z 
2025-05-07T20:25:41.8272364Z 
2025-05-07T20:25:41.8272368Z 
2025-05-07T20:25:41.8272372Z 
2025-05-07T20:25:41.8272376Z 
2025-05-07T20:25:41.8276131Z 
2025-05-07T20:25:41.8524107Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #######6   |  77% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.8524483Z 
2025-05-07T20:25:41.8524488Z 
2025-05-07T20:25:41.8524494Z 
2025-05-07T20:25:41.8524499Z 
2025-05-07T20:25:41.8524504Z 
2025-05-07T20:25:41.8524510Z 
2025-05-07T20:25:41.8524515Z 
2025-05-07T20:25:41.8524521Z 
2025-05-07T20:25:41.8524538Z 
2025-05-07T20:25:41.9383915Z libcurand-10.3.7.77  | 39.9 MB   | ####9      |  50% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:41.9384224Z 
2025-05-07T20:25:41.9384227Z 
2025-05-07T20:25:41.9384231Z 
2025-05-07T20:25:41.9384235Z 
2025-05-07T20:25:41.9384240Z 
2025-05-07T20:25:41.9384243Z 
2025-05-07T20:25:41.9384247Z 
2025-05-07T20:25:41.9388791Z 
2025-05-07T20:25:41.9528094Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########3  |  84% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.9528474Z 
2025-05-07T20:25:41.9528478Z 
2025-05-07T20:25:41.9528482Z 
2025-05-07T20:25:41.9528486Z 
2025-05-07T20:25:41.9528490Z 
2025-05-07T20:25:41.9528500Z 
2025-05-07T20:25:41.9528504Z 
2025-05-07T20:25:41.9528507Z 
2025-05-07T20:25:41.9530067Z 
2025-05-07T20:25:42.0385617Z libcurand-10.3.7.77  | 39.9 MB   | #####8     |  58% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.0385933Z 
2025-05-07T20:25:42.0385938Z 
2025-05-07T20:25:42.0385941Z 
2025-05-07T20:25:42.0385945Z 
2025-05-07T20:25:42.0385949Z 
2025-05-07T20:25:42.0385953Z 
2025-05-07T20:25:42.0385973Z 
2025-05-07T20:25:42.0385978Z 
2025-05-07T20:25:42.0529835Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #########  |  91% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.0530209Z 
2025-05-07T20:25:42.0530213Z 
2025-05-07T20:25:42.0530217Z 
2025-05-07T20:25:42.0530221Z 
2025-05-07T20:25:42.0530224Z 
2025-05-07T20:25:42.0530228Z 
2025-05-07T20:25:42.0530232Z 
2025-05-07T20:25:42.0530235Z 
2025-05-07T20:25:42.0531705Z 
2025-05-07T20:25:42.0815102Z libcurand-10.3.7.77  | 39.9 MB   | ######6    |  66% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.0815517Z 
2025-05-07T20:25:42.0815523Z 
2025-05-07T20:25:42.0815529Z 
2025-05-07T20:25:42.0815534Z 
2025-05-07T20:25:42.0815539Z 
2025-05-07T20:25:42.0815544Z 
2025-05-07T20:25:42.0815825Z 
2025-05-07T20:25:42.1222852Z libnpp-12.3.1.54     | 93.4 MB   | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:42.1223514Z 
2025-05-07T20:25:42.1223518Z 
2025-05-07T20:25:42.1223523Z 
2025-05-07T20:25:42.1223526Z 
2025-05-07T20:25:42.1223531Z 
2025-05-07T20:25:42.1223808Z 
2025-05-07T20:25:42.1223811Z 
2025-05-07T20:25:42.1223815Z 
2025-05-07T20:25:42.1223819Z 
2025-05-07T20:25:42.1226209Z 
2025-05-07T20:25:42.1390307Z gds-tools-1.11.1.6   | 37.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.1390700Z 
2025-05-07T20:25:42.1390706Z 
2025-05-07T20:25:42.1390712Z 
2025-05-07T20:25:42.1390718Z 
2025-05-07T20:25:42.1390723Z 
2025-05-07T20:25:42.1390728Z 
2025-05-07T20:25:42.1390744Z 
2025-05-07T20:25:42.1390749Z 
2025-05-07T20:25:42.1535446Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #########7 |  98% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.1535854Z 
2025-05-07T20:25:42.1535859Z 
2025-05-07T20:25:42.1535873Z 
2025-05-07T20:25:42.1535878Z 
2025-05-07T20:25:42.1535900Z 
2025-05-07T20:25:42.1535906Z 
2025-05-07T20:25:42.1535911Z 
2025-05-07T20:25:42.1535916Z 
2025-05-07T20:25:42.1535921Z 
2025-05-07T20:25:42.2230539Z libcurand-10.3.7.77  | 39.9 MB   | #######4   |  75% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.2230961Z 
2025-05-07T20:25:42.2230991Z 
2025-05-07T20:25:42.2230996Z 
2025-05-07T20:25:42.2231001Z 
2025-05-07T20:25:42.2231006Z 
2025-05-07T20:25:42.2231011Z 
2025-05-07T20:25:42.2231016Z 
2025-05-07T20:25:42.2231022Z 
2025-05-07T20:25:42.2231027Z 
2025-05-07T20:25:42.2231032Z 
2025-05-07T20:25:42.2632786Z gds-tools-1.11.1.6   | 37.8 MB   | 6          |   7% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.2633197Z 
2025-05-07T20:25:42.2633203Z 
2025-05-07T20:25:42.2633208Z 
2025-05-07T20:25:42.2633214Z 
2025-05-07T20:25:42.2633219Z 
2025-05-07T20:25:42.2633224Z 
2025-05-07T20:25:42.2633230Z 
2025-05-07T20:25:42.2633235Z 
2025-05-07T20:25:42.2636240Z 
2025-05-07T20:25:42.3234012Z libcurand-10.3.7.77  | 39.9 MB   | ########2  |  83% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.3234445Z 
2025-05-07T20:25:42.3234451Z 
2025-05-07T20:25:42.3234456Z 
2025-05-07T20:25:42.3234461Z 
2025-05-07T20:25:42.3234467Z 
2025-05-07T20:25:42.3234472Z 
2025-05-07T20:25:42.3234477Z 
2025-05-07T20:25:42.3234482Z 
2025-05-07T20:25:42.3234489Z 
2025-05-07T20:25:42.3234517Z 
2025-05-07T20:25:42.3635557Z gds-tools-1.11.1.6   | 37.8 MB   | #4         |  14% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.3635941Z 
2025-05-07T20:25:42.3635947Z 
2025-05-07T20:25:42.3635952Z 
2025-05-07T20:25:42.3635957Z 
2025-05-07T20:25:42.3635971Z 
2025-05-07T20:25:42.3635977Z 
2025-05-07T20:25:42.3635981Z 
2025-05-07T20:25:42.3635987Z 
2025-05-07T20:25:42.3639602Z 
2025-05-07T20:25:42.4234263Z libcurand-10.3.7.77  | 39.9 MB   | #########  |  91% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.4234660Z 
2025-05-07T20:25:42.4234666Z 
2025-05-07T20:25:42.4234671Z 
2025-05-07T20:25:42.4234684Z 
2025-05-07T20:25:42.4234689Z 
2025-05-07T20:25:42.4234694Z 
2025-05-07T20:25:42.4234700Z 
2025-05-07T20:25:42.4234722Z 
2025-05-07T20:25:42.4234728Z 
2025-05-07T20:25:42.4236528Z 
2025-05-07T20:25:42.4639875Z gds-tools-1.11.1.6   | 37.8 MB   | ##2        |  22% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.4640275Z 
2025-05-07T20:25:42.4640281Z 
2025-05-07T20:25:42.4640286Z 
2025-05-07T20:25:42.4640307Z 
2025-05-07T20:25:42.4640313Z 
2025-05-07T20:25:42.4640318Z 
2025-05-07T20:25:42.4640324Z 
2025-05-07T20:25:42.4640329Z 
2025-05-07T20:25:42.4640334Z 
2025-05-07T20:25:42.5238531Z libcurand-10.3.7.77  | 39.9 MB   | #########9 |  99% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.5238942Z 
2025-05-07T20:25:42.5238947Z 
2025-05-07T20:25:42.5238952Z 
2025-05-07T20:25:42.5238957Z 
2025-05-07T20:25:42.5238962Z 
2025-05-07T20:25:42.5238967Z 
2025-05-07T20:25:42.5238973Z 
2025-05-07T20:25:42.5238978Z 
2025-05-07T20:25:42.5238983Z 
2025-05-07T20:25:42.5238989Z 
2025-05-07T20:25:42.6250820Z gds-tools-1.11.1.6   | 37.8 MB   | ###1       |  31% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.6251219Z 
2025-05-07T20:25:42.6251456Z 
2025-05-07T20:25:42.6251462Z 
2025-05-07T20:25:42.6251465Z 
2025-05-07T20:25:42.6251469Z 
2025-05-07T20:25:42.6251473Z 
2025-05-07T20:25:42.6251476Z 
2025-05-07T20:25:42.6251481Z 
2025-05-07T20:25:42.6251486Z 
2025-05-07T20:25:42.6251498Z 
2025-05-07T20:25:42.7251178Z gds-tools-1.11.1.6   | 37.8 MB   | ###9       |  40% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.7251574Z 
2025-05-07T20:25:42.7251578Z 
2025-05-07T20:25:42.7251582Z 
2025-05-07T20:25:42.7251585Z 
2025-05-07T20:25:42.7251589Z 
2025-05-07T20:25:42.7251593Z 
2025-05-07T20:25:42.7251597Z 
2025-05-07T20:25:42.7251600Z 
2025-05-07T20:25:42.7251604Z 
2025-05-07T20:25:42.7252351Z 
2025-05-07T20:25:42.8254161Z gds-tools-1.11.1.6   | 37.8 MB   | ####8      |  48% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.8254564Z 
2025-05-07T20:25:42.8254569Z 
2025-05-07T20:25:42.8254578Z 
2025-05-07T20:25:42.8254584Z 
2025-05-07T20:25:42.8254589Z 
2025-05-07T20:25:42.8254595Z 
2025-05-07T20:25:42.8254601Z 
2025-05-07T20:25:42.8254606Z 
2025-05-07T20:25:42.8254642Z 
2025-05-07T20:25:42.8254648Z 
2025-05-07T20:25:42.9072648Z gds-tools-1.11.1.6   | 37.8 MB   | #####6     |  57% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.9073005Z 
2025-05-07T20:25:42.9073011Z 
2025-05-07T20:25:42.9074802Z 
2025-05-07T20:25:42.9254209Z libcusparse-12.5.4.2 | 118.6 MB  | ########## | 100% [A[A[A
2025-05-07T20:25:42.9254594Z 
2025-05-07T20:25:42.9254600Z 
2025-05-07T20:25:42.9254605Z 
2025-05-07T20:25:42.9254611Z 
2025-05-07T20:25:42.9254616Z 
2025-05-07T20:25:42.9254621Z 
2025-05-07T20:25:42.9254627Z 
2025-05-07T20:25:42.9254632Z 
2025-05-07T20:25:42.9254637Z 
2025-05-07T20:25:42.9254642Z 
2025-05-07T20:25:43.0258937Z gds-tools-1.11.1.6   | 37.8 MB   | ######5    |  66% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.0259352Z 
2025-05-07T20:25:43.0259358Z 
2025-05-07T20:25:43.0259363Z 
2025-05-07T20:25:43.0259369Z 
2025-05-07T20:25:43.0259374Z 
2025-05-07T20:25:43.0259379Z 
2025-05-07T20:25:43.0259385Z 
2025-05-07T20:25:43.0259390Z 
2025-05-07T20:25:43.0259420Z 
2025-05-07T20:25:43.0260877Z 
2025-05-07T20:25:43.1376921Z gds-tools-1.11.1.6   | 37.8 MB   | #######4   |  75% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.1377313Z 
2025-05-07T20:25:43.1377317Z 
2025-05-07T20:25:43.1377321Z 
2025-05-07T20:25:43.1377328Z 
2025-05-07T20:25:43.1377360Z 
2025-05-07T20:25:43.1377364Z 
2025-05-07T20:25:43.1377368Z 
2025-05-07T20:25:43.1377372Z 
2025-05-07T20:25:43.1377387Z 
2025-05-07T20:25:43.1377834Z 
2025-05-07T20:25:43.2378736Z gds-tools-1.11.1.6   | 37.8 MB   | ########3  |  84% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.2379137Z 
2025-05-07T20:25:43.2379153Z 
2025-05-07T20:25:43.2379159Z 
2025-05-07T20:25:43.2379164Z 
2025-05-07T20:25:43.2379169Z 
2025-05-07T20:25:43.2379176Z 
2025-05-07T20:25:43.2379181Z 
2025-05-07T20:25:43.2379186Z 
2025-05-07T20:25:43.2379192Z 
2025-05-07T20:25:43.2381774Z 
2025-05-07T20:25:43.7546138Z gds-tools-1.11.1.6   | 37.8 MB   | #########2 |  93% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.7546546Z 
2025-05-07T20:25:43.7546576Z 
2025-05-07T20:25:43.7546581Z 
2025-05-07T20:25:43.7546585Z 
2025-05-07T20:25:43.7546589Z 
2025-05-07T20:25:43.7546593Z 
2025-05-07T20:25:43.7546597Z 
2025-05-07T20:25:43.7546601Z 
2025-05-07T20:25:43.7546606Z 
2025-05-07T20:25:43.8027975Z libcurand-10.3.7.77  | 39.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.8028398Z 
2025-05-07T20:25:43.8028404Z 
2025-05-07T20:25:43.8028410Z 
2025-05-07T20:25:43.8028415Z 
2025-05-07T20:25:43.8028420Z 
2025-05-07T20:25:43.8028425Z 
2025-05-07T20:25:43.8028430Z 
2025-05-07T20:25:43.8028449Z 
2025-05-07T20:25:43.8028454Z 
2025-05-07T20:25:43.8028459Z 
2025-05-07T20:25:43.8030844Z 
2025-05-07T20:25:43.9031711Z cuda-nvcc-tools-12.6 | 23.0 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.9032158Z 
2025-05-07T20:25:43.9032164Z 
2025-05-07T20:25:43.9032170Z 
2025-05-07T20:25:43.9032175Z 
2025-05-07T20:25:43.9032181Z 
2025-05-07T20:25:43.9032186Z 
2025-05-07T20:25:43.9032192Z 
2025-05-07T20:25:43.9032197Z 
2025-05-07T20:25:43.9032470Z 
2025-05-07T20:25:43.9032476Z 
2025-05-07T20:25:43.9032480Z 
2025-05-07T20:25:43.9100134Z cuda-nvcc-tools-12.6 | 23.0 MB   | #5         |  16% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.9100567Z 
2025-05-07T20:25:43.9100573Z 
2025-05-07T20:25:43.9100847Z 
2025-05-07T20:25:43.9100853Z 
2025-05-07T20:25:43.9100859Z 
2025-05-07T20:25:43.9100864Z 
2025-05-07T20:25:43.9100869Z 
2025-05-07T20:25:43.9102369Z 
2025-05-07T20:25:43.9567798Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:43.9568203Z 
2025-05-07T20:25:43.9568210Z 
2025-05-07T20:25:43.9568214Z 
2025-05-07T20:25:43.9568218Z 
2025-05-07T20:25:43.9568221Z 
2025-05-07T20:25:43.9568225Z 
2025-05-07T20:25:43.9568229Z 
2025-05-07T20:25:43.9568233Z 
2025-05-07T20:25:43.9568237Z 
2025-05-07T20:25:43.9568240Z 
2025-05-07T20:25:43.9568244Z 
2025-05-07T20:25:43.9572051Z 
2025-05-07T20:25:43.9627747Z python-3.9.18        | 22.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.9628052Z 
2025-05-07T20:25:44.0032229Z libcublas-12.6.4.1   | 256.2 MB  | ########## | 100% [A
2025-05-07T20:25:44.0032501Z 
2025-05-07T20:25:44.0032640Z 
2025-05-07T20:25:44.0032646Z 
2025-05-07T20:25:44.0032650Z 
2025-05-07T20:25:44.0032654Z 
2025-05-07T20:25:44.0032677Z 
2025-05-07T20:25:44.0032702Z 
2025-05-07T20:25:44.0032706Z 
2025-05-07T20:25:44.0032710Z 
2025-05-07T20:25:44.0032713Z 
2025-05-07T20:25:44.0033175Z 
2025-05-07T20:25:44.0267665Z cuda-nvcc-tools-12.6 | 23.0 MB   | ###1       |  32% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.0268052Z 
2025-05-07T20:25:44.0268058Z 
2025-05-07T20:25:44.0268063Z 
2025-05-07T20:25:44.0268069Z 
2025-05-07T20:25:44.0268074Z 
2025-05-07T20:25:44.0268080Z 
2025-05-07T20:25:44.0268085Z 
2025-05-07T20:25:44.0268090Z 
2025-05-07T20:25:44.0268095Z 
2025-05-07T20:25:44.0268101Z 
2025-05-07T20:25:44.0268106Z 
2025-05-07T20:25:44.0268110Z 
2025-05-07T20:25:44.0271346Z 
2025-05-07T20:25:44.0570875Z cuda-nvrtc-12.6.85   | 17.3 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.0571316Z 
2025-05-07T20:25:44.0571320Z 
2025-05-07T20:25:44.0571324Z 
2025-05-07T20:25:44.0571328Z 
2025-05-07T20:25:44.0571332Z 
2025-05-07T20:25:44.0571335Z 
2025-05-07T20:25:44.0571339Z 
2025-05-07T20:25:44.0571353Z 
2025-05-07T20:25:44.0571358Z 
2025-05-07T20:25:44.0571362Z 
2025-05-07T20:25:44.0571366Z 
2025-05-07T20:25:44.0571977Z 
2025-05-07T20:25:44.1212576Z python-3.9.18        | 22.7 MB   | #3         |  13% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.1212899Z 
2025-05-07T20:25:44.1212903Z 
2025-05-07T20:25:44.1212907Z 
2025-05-07T20:25:44.1212917Z 
2025-05-07T20:25:44.1212921Z 
2025-05-07T20:25:44.1212925Z 
2025-05-07T20:25:44.1212929Z 
2025-05-07T20:25:44.1212932Z 
2025-05-07T20:25:44.1212936Z 
2025-05-07T20:25:44.1212940Z 
2025-05-07T20:25:44.1212944Z 
2025-05-07T20:25:44.1267419Z cuda-nvcc-tools-12.6 | 23.0 MB   | ####8      |  48% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.1267746Z 
2025-05-07T20:25:44.1267770Z 
2025-05-07T20:25:44.1267774Z 
2025-05-07T20:25:44.1267778Z 
2025-05-07T20:25:44.1267782Z 
2025-05-07T20:25:44.1267785Z 
2025-05-07T20:25:44.1267789Z 
2025-05-07T20:25:44.1267793Z 
2025-05-07T20:25:44.1267797Z 
2025-05-07T20:25:44.1267801Z 
2025-05-07T20:25:44.1267812Z 
2025-05-07T20:25:44.1267816Z 
2025-05-07T20:25:44.1272194Z 
2025-05-07T20:25:44.1641385Z cuda-nvrtc-12.6.85   | 17.3 MB   | #6         |  17% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.1641759Z 
2025-05-07T20:25:44.1641763Z 
2025-05-07T20:25:44.1641767Z 
2025-05-07T20:25:44.1641771Z 
2025-05-07T20:25:44.1641775Z 
2025-05-07T20:25:44.1641778Z 
2025-05-07T20:25:44.1641782Z 
2025-05-07T20:25:44.1641786Z 
2025-05-07T20:25:44.1641790Z 
2025-05-07T20:25:44.1641793Z 
2025-05-07T20:25:44.1641797Z 
2025-05-07T20:25:44.1641801Z 
2025-05-07T20:25:44.2116904Z python-3.9.18        | 22.7 MB   | ##6        |  26% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.2117308Z 
2025-05-07T20:25:44.2117319Z 
2025-05-07T20:25:44.2321218Z libcufft-11.3.0.4    | 156.2 MB  | ########## | 100% [A[A
2025-05-07T20:25:44.2321513Z 
2025-05-07T20:25:44.2321519Z 
2025-05-07T20:25:44.2321525Z 
2025-05-07T20:25:44.2321530Z 
2025-05-07T20:25:44.2321535Z 
2025-05-07T20:25:44.2321540Z 
2025-05-07T20:25:44.2321545Z 
2025-05-07T20:25:44.2321747Z 
2025-05-07T20:25:44.2321753Z 
2025-05-07T20:25:44.2321759Z 
2025-05-07T20:25:44.2321764Z 
2025-05-07T20:25:44.2321769Z 
2025-05-07T20:25:44.2326822Z 
2025-05-07T20:25:44.2445229Z cuda-nvrtc-12.6.85   | 17.3 MB   | ###3       |  33% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.2445663Z 
2025-05-07T20:25:44.2445668Z 
2025-05-07T20:25:44.2445674Z 
2025-05-07T20:25:44.2445679Z 
2025-05-07T20:25:44.2445684Z 
2025-05-07T20:25:44.2445689Z 
2025-05-07T20:25:44.2445694Z 
2025-05-07T20:25:44.2445700Z 
2025-05-07T20:25:44.2445705Z 
2025-05-07T20:25:44.2445710Z 
2025-05-07T20:25:44.2445726Z 
2025-05-07T20:25:44.2803483Z cuda-nvcc-tools-12.6 | 23.0 MB   | ######2    |  63% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.2804116Z 
2025-05-07T20:25:44.2804120Z 
2025-05-07T20:25:44.2804124Z 
2025-05-07T20:25:44.2804135Z 
2025-05-07T20:25:44.2804139Z 
2025-05-07T20:25:44.2804143Z 
2025-05-07T20:25:44.2804147Z 
2025-05-07T20:25:44.2804150Z 
2025-05-07T20:25:44.2804155Z 
2025-05-07T20:25:44.2804174Z 
2025-05-07T20:25:44.2804178Z 
2025-05-07T20:25:44.2808238Z 
2025-05-07T20:25:44.3414783Z python-3.9.18        | 22.7 MB   | ###9       |  39% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.3415097Z 
2025-05-07T20:25:44.3415101Z 
2025-05-07T20:25:44.3415105Z 
2025-05-07T20:25:44.3415108Z 
2025-05-07T20:25:44.3415112Z 
2025-05-07T20:25:44.3415116Z 
2025-05-07T20:25:44.3415120Z 
2025-05-07T20:25:44.3415124Z 
2025-05-07T20:25:44.3415128Z 
2025-05-07T20:25:44.3415132Z 
2025-05-07T20:25:44.3415135Z 
2025-05-07T20:25:44.3415139Z 
2025-05-07T20:25:44.3415143Z 
2025-05-07T20:25:44.3544092Z cuda-nvrtc-12.6.85   | 17.3 MB   | ####9      |  49% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.3544467Z 
2025-05-07T20:25:44.3544490Z 
2025-05-07T20:25:44.3544494Z 
2025-05-07T20:25:44.3544497Z 
2025-05-07T20:25:44.3544501Z 
2025-05-07T20:25:44.3544505Z 
2025-05-07T20:25:44.3544508Z 
2025-05-07T20:25:44.3544513Z 
2025-05-07T20:25:44.3544517Z 
2025-05-07T20:25:44.3544533Z 
2025-05-07T20:25:44.3544551Z 
2025-05-07T20:25:44.3838575Z cuda-nvcc-tools-12.6 | 23.0 MB   | #######6   |  77% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.3838895Z 
2025-05-07T20:25:44.3838899Z 
2025-05-07T20:25:44.3838909Z 
2025-05-07T20:25:44.3838912Z 
2025-05-07T20:25:44.3838916Z 
2025-05-07T20:25:44.3838920Z 
2025-05-07T20:25:44.3838924Z 
2025-05-07T20:25:44.3838927Z 
2025-05-07T20:25:44.3838931Z 
2025-05-07T20:25:44.3838935Z 
2025-05-07T20:25:44.3838938Z 
2025-05-07T20:25:44.3842267Z 
2025-05-07T20:25:44.4419096Z python-3.9.18        | 22.7 MB   | #####1     |  51% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.4419541Z 
2025-05-07T20:25:44.4419547Z 
2025-05-07T20:25:44.4419552Z 
2025-05-07T20:25:44.4419557Z 
2025-05-07T20:25:44.4419579Z 
2025-05-07T20:25:44.4419585Z 
2025-05-07T20:25:44.4419591Z 
2025-05-07T20:25:44.4419596Z 
2025-05-07T20:25:44.4419602Z 
2025-05-07T20:25:44.4419607Z 
2025-05-07T20:25:44.4419612Z 
2025-05-07T20:25:44.4419618Z 
2025-05-07T20:25:44.4419623Z 
2025-05-07T20:25:44.4603946Z cuda-nvrtc-12.6.85   | 17.3 MB   | ######6    |  66% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.4604394Z 
2025-05-07T20:25:44.4604398Z 
2025-05-07T20:25:44.4604402Z 
2025-05-07T20:25:44.4604406Z 
2025-05-07T20:25:44.4604409Z 
2025-05-07T20:25:44.4604413Z 
2025-05-07T20:25:44.4604417Z 
2025-05-07T20:25:44.4604420Z 
2025-05-07T20:25:44.4604424Z 
2025-05-07T20:25:44.4604428Z 
2025-05-07T20:25:44.4605838Z 
2025-05-07T20:25:44.4839943Z cuda-nvcc-tools-12.6 | 23.0 MB   | #########  |  90% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.4840333Z 
2025-05-07T20:25:44.4840337Z 
2025-05-07T20:25:44.4840341Z 
2025-05-07T20:25:44.4840345Z 
2025-05-07T20:25:44.4840348Z 
2025-05-07T20:25:44.4840352Z 
2025-05-07T20:25:44.4840613Z 
2025-05-07T20:25:44.4840618Z 
2025-05-07T20:25:44.4840622Z 
2025-05-07T20:25:44.4840626Z 
2025-05-07T20:25:44.4840630Z 
2025-05-07T20:25:44.4840633Z 
2025-05-07T20:25:44.5426122Z python-3.9.18        | 22.7 MB   | ######3    |  63% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.5426679Z 
2025-05-07T20:25:44.5426683Z 
2025-05-07T20:25:44.5426686Z 
2025-05-07T20:25:44.5426690Z 
2025-05-07T20:25:44.5426694Z 
2025-05-07T20:25:44.5426698Z 
2025-05-07T20:25:44.5426702Z 
2025-05-07T20:25:44.5426705Z 
2025-05-07T20:25:44.5426709Z 
2025-05-07T20:25:44.5426713Z 
2025-05-07T20:25:44.5426717Z 
2025-05-07T20:25:44.5426720Z 
2025-05-07T20:25:44.5431876Z 
2025-05-07T20:25:44.5876008Z cuda-nvrtc-12.6.85   | 17.3 MB   | ########2  |  82% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.5876345Z 
2025-05-07T20:25:44.5876349Z 
2025-05-07T20:25:44.5876353Z 
2025-05-07T20:25:44.5876357Z 
2025-05-07T20:25:44.5876361Z 
2025-05-07T20:25:44.5876365Z 
2025-05-07T20:25:44.5876368Z 
2025-05-07T20:25:44.5876388Z 
2025-05-07T20:25:44.5876392Z 
2025-05-07T20:25:44.5876395Z 
2025-05-07T20:25:44.5876399Z 
2025-05-07T20:25:44.5876403Z 
2025-05-07T20:25:44.6427818Z python-3.9.18        | 22.7 MB   | #######5   |  75% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.6428188Z 
2025-05-07T20:25:44.6428194Z 
2025-05-07T20:25:44.6428199Z 
2025-05-07T20:25:44.6428204Z 
2025-05-07T20:25:44.6428220Z 
2025-05-07T20:25:44.6428226Z 
2025-05-07T20:25:44.6428232Z 
2025-05-07T20:25:44.6428237Z 
2025-05-07T20:25:44.6428242Z 
2025-05-07T20:25:44.6428247Z 
2025-05-07T20:25:44.6428251Z 
2025-05-07T20:25:44.6428255Z 
2025-05-07T20:25:44.6428258Z 
2025-05-07T20:25:44.6670673Z cuda-nvrtc-12.6.85   | 17.3 MB   | #########8 |  99% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.6671115Z 
2025-05-07T20:25:44.6671121Z 
2025-05-07T20:25:44.6671126Z 
2025-05-07T20:25:44.6671131Z 
2025-05-07T20:25:44.6671137Z 
2025-05-07T20:25:44.6671142Z 
2025-05-07T20:25:44.6671147Z 
2025-05-07T20:25:44.6671153Z 
2025-05-07T20:25:44.6671171Z 
2025-05-07T20:25:44.6676955Z 
2025-05-07T20:25:44.6881721Z gds-tools-1.11.1.6   | 37.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.6882088Z 
2025-05-07T20:25:44.6882094Z 
2025-05-07T20:25:44.6882099Z 
2025-05-07T20:25:44.6882104Z 
2025-05-07T20:25:44.6882126Z 
2025-05-07T20:25:44.6882131Z 
2025-05-07T20:25:44.6882136Z 
2025-05-07T20:25:44.6882141Z 
2025-05-07T20:25:44.6882146Z 
2025-05-07T20:25:44.6882151Z 
2025-05-07T20:25:44.6882156Z 
2025-05-07T20:25:44.6885261Z 
2025-05-07T20:25:44.7056712Z python-3.9.18        | 22.7 MB   | ########9  |  89% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.7057101Z 
2025-05-07T20:25:44.7057105Z 
2025-05-07T20:25:44.7057109Z 
2025-05-07T20:25:44.7057113Z 
2025-05-07T20:25:44.7057117Z 
2025-05-07T20:25:44.7057120Z 
2025-05-07T20:25:44.7057124Z 
2025-05-07T20:25:44.7057128Z 
2025-05-07T20:25:44.7057132Z 
2025-05-07T20:25:44.7057135Z 
2025-05-07T20:25:44.7057139Z 
2025-05-07T20:25:44.7057143Z 
2025-05-07T20:25:44.7057147Z 
2025-05-07T20:25:44.7060602Z 
2025-05-07T20:25:44.8059822Z libnvjitlink-12.6.85 | 14.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.8060164Z 
2025-05-07T20:25:44.8060169Z 
2025-05-07T20:25:44.8060173Z 
2025-05-07T20:25:44.8060176Z 
2025-05-07T20:25:44.8060210Z 
2025-05-07T20:25:44.8060223Z 
2025-05-07T20:25:44.8060227Z 
2025-05-07T20:25:44.8060231Z 
2025-05-07T20:25:44.8060234Z 
2025-05-07T20:25:44.8060238Z 
2025-05-07T20:25:44.8060242Z 
2025-05-07T20:25:44.8060245Z 
2025-05-07T20:25:44.8060249Z 
2025-05-07T20:25:44.8061852Z 
2025-05-07T20:25:44.9064789Z libnvjitlink-12.6.85 | 14.9 MB   | ##3        |  23% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.9065258Z 
2025-05-07T20:25:44.9065262Z 
2025-05-07T20:25:44.9065266Z 
2025-05-07T20:25:44.9065269Z 
2025-05-07T20:25:44.9065273Z 
2025-05-07T20:25:44.9065278Z 
2025-05-07T20:25:44.9065281Z 
2025-05-07T20:25:44.9065288Z 
2025-05-07T20:25:44.9065292Z 
2025-05-07T20:25:44.9065296Z 
2025-05-07T20:25:44.9065520Z 
2025-05-07T20:25:44.9065527Z 
2025-05-07T20:25:44.9065530Z 
2025-05-07T20:25:44.9066748Z 
2025-05-07T20:25:45.0065286Z libnvjitlink-12.6.85 | 14.9 MB   | ####7      |  47% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.0065747Z 
2025-05-07T20:25:45.0065753Z 
2025-05-07T20:25:45.0066046Z 
2025-05-07T20:25:45.0066064Z 
2025-05-07T20:25:45.0066070Z 
2025-05-07T20:25:45.0066075Z 
2025-05-07T20:25:45.0066080Z 
2025-05-07T20:25:45.0066083Z 
2025-05-07T20:25:45.0066087Z 
2025-05-07T20:25:45.0066091Z 
2025-05-07T20:25:45.0066095Z 
2025-05-07T20:25:45.0066101Z 
2025-05-07T20:25:45.0066105Z 
2025-05-07T20:25:45.0067642Z 
2025-05-07T20:25:45.1066211Z libnvjitlink-12.6.85 | 14.9 MB   | #######2   |  73% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.1066559Z 
2025-05-07T20:25:45.1066563Z 
2025-05-07T20:25:45.1066567Z 
2025-05-07T20:25:45.1066571Z 
2025-05-07T20:25:45.1066575Z 
2025-05-07T20:25:45.1066579Z 
2025-05-07T20:25:45.1066582Z 
2025-05-07T20:25:45.1066586Z 
2025-05-07T20:25:45.1066615Z 
2025-05-07T20:25:45.1066619Z 
2025-05-07T20:25:45.1066623Z 
2025-05-07T20:25:45.1066626Z 
2025-05-07T20:25:45.1066630Z 
2025-05-07T20:25:45.1068907Z 
2025-05-07T20:25:45.2404831Z libnvjitlink-12.6.85 | 14.9 MB   | #########7 |  97% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.2405315Z 
2025-05-07T20:25:45.2405319Z 
2025-05-07T20:25:45.2405323Z 
2025-05-07T20:25:45.2405327Z 
2025-05-07T20:25:45.2405346Z 
2025-05-07T20:25:45.2405350Z 
2025-05-07T20:25:45.2405353Z 
2025-05-07T20:25:45.2405357Z 
2025-05-07T20:25:45.2405361Z 
2025-05-07T20:25:45.2405364Z 
2025-05-07T20:25:45.2405369Z 
2025-05-07T20:25:45.2405372Z 
2025-05-07T20:25:45.2407259Z 
2025-05-07T20:25:45.2821333Z cuda-nvrtc-12.6.85   | 17.3 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.2821659Z 
2025-05-07T20:25:45.2821663Z 
2025-05-07T20:25:45.2821667Z 
2025-05-07T20:25:45.2821670Z 
2025-05-07T20:25:45.2821674Z 
2025-05-07T20:25:45.2821678Z 
2025-05-07T20:25:45.2821682Z 
2025-05-07T20:25:45.2821714Z 
2025-05-07T20:25:45.2821718Z 
2025-05-07T20:25:45.2821722Z 
2025-05-07T20:25:45.2833325Z 
2025-05-07T20:25:45.2883599Z cuda-nvcc-tools-12.6 | 23.0 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.2884113Z 
2025-05-07T20:25:45.2884119Z 
2025-05-07T20:25:45.2884139Z 
2025-05-07T20:25:45.2884144Z 
2025-05-07T20:25:45.2884150Z 
2025-05-07T20:25:45.2884155Z 
2025-05-07T20:25:45.2884160Z 
2025-05-07T20:25:45.2884165Z 
2025-05-07T20:25:45.2884171Z 
2025-05-07T20:25:45.2884176Z 
2025-05-07T20:25:45.2884181Z 
2025-05-07T20:25:45.2884186Z 
2025-05-07T20:25:45.2884191Z 
2025-05-07T20:25:45.2884196Z 
2025-05-07T20:25:45.2884524Z 
2025-05-07T20:25:45.3231639Z cuda-nvcc-dev_linux- | 10.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.3231974Z 
2025-05-07T20:25:45.3231979Z 
2025-05-07T20:25:45.3231982Z 
2025-05-07T20:25:45.3231986Z 
2025-05-07T20:25:45.3231990Z 
2025-05-07T20:25:45.3231994Z 
2025-05-07T20:25:45.3231998Z 
2025-05-07T20:25:45.3232014Z 
2025-05-07T20:25:45.3232018Z 
2025-05-07T20:25:45.3232022Z 
2025-05-07T20:25:45.3232026Z 
2025-05-07T20:25:45.3232039Z 
2025-05-07T20:25:45.3232042Z 
2025-05-07T20:25:45.3232046Z 
2025-05-07T20:25:45.3232050Z 
2025-05-07T20:25:45.3235930Z 
2025-05-07T20:25:45.3885438Z cuda-nvvm-tools-12.6 | 10.4 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.3885940Z 
2025-05-07T20:25:45.3885947Z 
2025-05-07T20:25:45.3885951Z 
2025-05-07T20:25:45.3885954Z 
2025-05-07T20:25:45.3885958Z 
2025-05-07T20:25:45.3885962Z 
2025-05-07T20:25:45.3885965Z 
2025-05-07T20:25:45.3885969Z 
2025-05-07T20:25:45.3885973Z 
2025-05-07T20:25:45.3885976Z 
2025-05-07T20:25:45.3885980Z 
2025-05-07T20:25:45.3885984Z 
2025-05-07T20:25:45.3885987Z 
2025-05-07T20:25:45.3885991Z 
2025-05-07T20:25:45.3885995Z 
2025-05-07T20:25:45.4232633Z cuda-nvcc-dev_linux- | 10.8 MB   | ##8        |  28% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.4233128Z 
2025-05-07T20:25:45.4233133Z 
2025-05-07T20:25:45.4233357Z 
2025-05-07T20:25:45.4233362Z 
2025-05-07T20:25:45.4233366Z 
2025-05-07T20:25:45.4233369Z 
2025-05-07T20:25:45.4233373Z 
2025-05-07T20:25:45.4233376Z 
2025-05-07T20:25:45.4233381Z 
2025-05-07T20:25:45.4233384Z 
2025-05-07T20:25:45.4233388Z 
2025-05-07T20:25:45.4233547Z 
2025-05-07T20:25:45.4233551Z 
2025-05-07T20:25:45.4233565Z 
2025-05-07T20:25:45.4233568Z 
2025-05-07T20:25:45.4234099Z 
2025-05-07T20:25:45.4886344Z cuda-nvvm-tools-12.6 | 10.4 MB   | ###        |  30% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.4886732Z 
2025-05-07T20:25:45.4886746Z 
2025-05-07T20:25:45.4886750Z 
2025-05-07T20:25:45.4886754Z 
2025-05-07T20:25:45.4886758Z 
2025-05-07T20:25:45.4886761Z 
2025-05-07T20:25:45.4886765Z 
2025-05-07T20:25:45.4886772Z 
2025-05-07T20:25:45.4886778Z 
2025-05-07T20:25:45.4886783Z 
2025-05-07T20:25:45.4886788Z 
2025-05-07T20:25:45.4886794Z 
2025-05-07T20:25:45.4886799Z 
2025-05-07T20:25:45.4886804Z 
2025-05-07T20:25:45.4894030Z 
2025-05-07T20:25:45.4970817Z cuda-nvcc-dev_linux- | 10.8 MB   | #####6     |  57% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.4971247Z 
2025-05-07T20:25:45.4971251Z 
2025-05-07T20:25:45.4971255Z 
2025-05-07T20:25:45.4971258Z 
2025-05-07T20:25:45.4971262Z 
2025-05-07T20:25:45.4971280Z 
2025-05-07T20:25:45.4971284Z 
2025-05-07T20:25:45.4971287Z 
2025-05-07T20:25:45.4971291Z 
2025-05-07T20:25:45.4971295Z 
2025-05-07T20:25:45.4971298Z 
2025-05-07T20:25:45.4974267Z 
2025-05-07T20:25:45.5323495Z python-3.9.18        | 22.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.5323900Z 
2025-05-07T20:25:45.5323904Z 
2025-05-07T20:25:45.5323908Z 
2025-05-07T20:25:45.5323911Z 
2025-05-07T20:25:45.5323924Z 
2025-05-07T20:25:45.5323928Z 
2025-05-07T20:25:45.5323932Z 
2025-05-07T20:25:45.5323935Z 
2025-05-07T20:25:45.5323941Z 
2025-05-07T20:25:45.5323955Z 
2025-05-07T20:25:45.5323960Z 
2025-05-07T20:25:45.5323965Z 
2025-05-07T20:25:45.5323971Z 
2025-05-07T20:25:45.5323976Z 
2025-05-07T20:25:45.5324004Z 
2025-05-07T20:25:45.5324013Z 
2025-05-07T20:25:45.5888990Z cuda-nvvm-tools-12.6 | 10.4 MB   | ######     |  60% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.5889541Z 
2025-05-07T20:25:45.5889547Z 
2025-05-07T20:25:45.5889552Z 
2025-05-07T20:25:45.5889594Z 
2025-05-07T20:25:45.5889600Z 
2025-05-07T20:25:45.5889605Z 
2025-05-07T20:25:45.5889610Z 
2025-05-07T20:25:45.5889615Z 
2025-05-07T20:25:45.5889621Z 
2025-05-07T20:25:45.5889626Z 
2025-05-07T20:25:45.5889631Z 
2025-05-07T20:25:45.5889636Z 
2025-05-07T20:25:45.5889641Z 
2025-05-07T20:25:45.5889647Z 
2025-05-07T20:25:45.5890030Z 
2025-05-07T20:25:45.6142612Z cuda-nvcc-dev_linux- | 10.8 MB   | ########6  |  87% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.6142951Z 
2025-05-07T20:25:45.6142965Z 
2025-05-07T20:25:45.6142976Z 
2025-05-07T20:25:45.6142980Z 
2025-05-07T20:25:45.6142983Z 
2025-05-07T20:25:45.6142987Z 
2025-05-07T20:25:45.6142991Z 
2025-05-07T20:25:45.6142995Z 
2025-05-07T20:25:45.6142998Z 
2025-05-07T20:25:45.6143022Z 
2025-05-07T20:25:45.6143025Z 
2025-05-07T20:25:45.6143029Z 
2025-05-07T20:25:45.6143033Z 
2025-05-07T20:25:45.6145505Z 
2025-05-07T20:25:45.6323738Z libnvjitlink-12.6.85 | 14.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.6324261Z 
2025-05-07T20:25:45.6324268Z 
2025-05-07T20:25:45.6324273Z 
2025-05-07T20:25:45.6324278Z 
2025-05-07T20:25:45.6324283Z 
2025-05-07T20:25:45.6324288Z 
2025-05-07T20:25:45.6324294Z 
2025-05-07T20:25:45.6324299Z 
2025-05-07T20:25:45.6324304Z 
2025-05-07T20:25:45.6324309Z 
2025-05-07T20:25:45.6324315Z 
2025-05-07T20:25:45.6324320Z 
2025-05-07T20:25:45.6324325Z 
2025-05-07T20:25:45.6324331Z 
2025-05-07T20:25:45.6324336Z 
2025-05-07T20:25:45.6325735Z 
2025-05-07T20:25:45.6594647Z cuda-nvvm-tools-12.6 | 10.4 MB   | #########4 |  94% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.6594982Z 
2025-05-07T20:25:45.6594994Z 
2025-05-07T20:25:45.6594998Z 
2025-05-07T20:25:45.6595001Z 
2025-05-07T20:25:45.6595259Z 
2025-05-07T20:25:45.6595267Z 
2025-05-07T20:25:45.6595272Z 
2025-05-07T20:25:45.6595277Z 
2025-05-07T20:25:45.6595283Z 
2025-05-07T20:25:45.6595288Z 
2025-05-07T20:25:45.6595294Z 
2025-05-07T20:25:45.6595299Z 
2025-05-07T20:25:45.6595304Z 
2025-05-07T20:25:45.6595310Z 
2025-05-07T20:25:45.6595498Z 
2025-05-07T20:25:45.6595504Z 
2025-05-07T20:25:45.6595509Z 
2025-05-07T20:25:45.6601654Z 
2025-05-07T20:25:45.7520480Z cuda-nvvm-impl-12.6. | 7.7 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.7520842Z 
2025-05-07T20:25:45.7520846Z 
2025-05-07T20:25:45.7520850Z 
2025-05-07T20:25:45.7520854Z 
2025-05-07T20:25:45.7520858Z 
2025-05-07T20:25:45.7520862Z 
2025-05-07T20:25:45.7520866Z 
2025-05-07T20:25:45.7520870Z 
2025-05-07T20:25:45.7520881Z 
2025-05-07T20:25:45.7520885Z 
2025-05-07T20:25:45.7520888Z 
2025-05-07T20:25:45.7520892Z 
2025-05-07T20:25:45.7520896Z 
2025-05-07T20:25:45.7520899Z 
2025-05-07T20:25:45.7520903Z 
2025-05-07T20:25:45.7520906Z 
2025-05-07T20:25:45.7526625Z 
2025-05-07T20:25:45.7598196Z cuda-sanitizer-api-1 | 8.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.7598757Z 
2025-05-07T20:25:45.7598764Z 
2025-05-07T20:25:45.7598770Z 
2025-05-07T20:25:45.7598775Z 
2025-05-07T20:25:45.7598799Z 
2025-05-07T20:25:45.7598805Z 
2025-05-07T20:25:45.7598811Z 
2025-05-07T20:25:45.7598818Z 
2025-05-07T20:25:45.7598824Z 
2025-05-07T20:25:45.7598829Z 
2025-05-07T20:25:45.7598835Z 
2025-05-07T20:25:45.7598841Z 
2025-05-07T20:25:45.7598847Z 
2025-05-07T20:25:45.7598853Z 
2025-05-07T20:25:45.7598858Z 
2025-05-07T20:25:45.7598864Z 
2025-05-07T20:25:45.7598870Z 
2025-05-07T20:25:45.7600230Z 
2025-05-07T20:25:45.8522115Z cuda-nvvm-impl-12.6. | 7.7 MB    | ####2      |  43% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.8522539Z 
2025-05-07T20:25:45.8522543Z 
2025-05-07T20:25:45.8522556Z 
2025-05-07T20:25:45.8522559Z 
2025-05-07T20:25:45.8522563Z 
2025-05-07T20:25:45.8522567Z 
2025-05-07T20:25:45.8522590Z 
2025-05-07T20:25:45.8522593Z 
2025-05-07T20:25:45.8522597Z 
2025-05-07T20:25:45.8522601Z 
2025-05-07T20:25:45.8522604Z 
2025-05-07T20:25:45.8522608Z 
2025-05-07T20:25:45.8522612Z 
2025-05-07T20:25:45.8522617Z 
2025-05-07T20:25:45.8522621Z 
2025-05-07T20:25:45.8522634Z 
2025-05-07T20:25:45.8526920Z 
2025-05-07T20:25:45.8849690Z cuda-sanitizer-api-1 | 8.9 MB    | ###7       |  38% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.8850046Z 
2025-05-07T20:25:45.8850052Z 
2025-05-07T20:25:45.8850058Z 
2025-05-07T20:25:45.8850063Z 
2025-05-07T20:25:45.8850068Z 
2025-05-07T20:25:45.8850074Z 
2025-05-07T20:25:45.8850079Z 
2025-05-07T20:25:45.8850084Z 
2025-05-07T20:25:45.8850089Z 
2025-05-07T20:25:45.8850095Z 
2025-05-07T20:25:45.8850100Z 
2025-05-07T20:25:45.8850114Z 
2025-05-07T20:25:45.8850119Z 
2025-05-07T20:25:45.8850124Z 
2025-05-07T20:25:45.8850130Z 
2025-05-07T20:25:45.8850135Z 
2025-05-07T20:25:45.8850140Z 
2025-05-07T20:25:45.8856817Z 
2025-05-07T20:25:45.9525184Z cuda-nvvm-impl-12.6. | 7.7 MB    | ########5  |  86% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.9525535Z 
2025-05-07T20:25:45.9525547Z 
2025-05-07T20:25:45.9525551Z 
2025-05-07T20:25:45.9525555Z 
2025-05-07T20:25:45.9525559Z 
2025-05-07T20:25:45.9525573Z 
2025-05-07T20:25:45.9525576Z 
2025-05-07T20:25:45.9525580Z 
2025-05-07T20:25:45.9525584Z 
2025-05-07T20:25:45.9525587Z 
2025-05-07T20:25:45.9525591Z 
2025-05-07T20:25:45.9525595Z 
2025-05-07T20:25:45.9525598Z 
2025-05-07T20:25:45.9525602Z 
2025-05-07T20:25:45.9525606Z 
2025-05-07T20:25:45.9525610Z 
2025-05-07T20:25:45.9526049Z 
2025-05-07T20:25:46.0198602Z cuda-sanitizer-api-1 | 8.9 MB    | #######5   |  75% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.0198951Z 
2025-05-07T20:25:46.0198955Z 
2025-05-07T20:25:46.0198959Z 
2025-05-07T20:25:46.0198963Z 
2025-05-07T20:25:46.0198967Z 
2025-05-07T20:25:46.0198971Z 
2025-05-07T20:25:46.0198974Z 
2025-05-07T20:25:46.0198979Z 
2025-05-07T20:25:46.0198982Z 
2025-05-07T20:25:46.0199259Z 
2025-05-07T20:25:46.0199267Z 
2025-05-07T20:25:46.0199281Z 
2025-05-07T20:25:46.0199287Z 
2025-05-07T20:25:46.0199292Z 
2025-05-07T20:25:46.0199297Z 
2025-05-07T20:25:46.0202570Z 
2025-05-07T20:25:46.0212127Z cuda-nvvm-tools-12.6 | 10.4 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.0212786Z 
2025-05-07T20:25:46.0212802Z 
2025-05-07T20:25:46.0212806Z 
2025-05-07T20:25:46.0212810Z 
2025-05-07T20:25:46.0212813Z 
2025-05-07T20:25:46.0212817Z 
2025-05-07T20:25:46.0212821Z 
2025-05-07T20:25:46.0212824Z 
2025-05-07T20:25:46.0212828Z 
2025-05-07T20:25:46.0212832Z 
2025-05-07T20:25:46.0212835Z 
2025-05-07T20:25:46.0212839Z 
2025-05-07T20:25:46.0212843Z 
2025-05-07T20:25:46.0212846Z 
2025-05-07T20:25:46.0212850Z 
2025-05-07T20:25:46.0801739Z cuda-nvcc-dev_linux- | 10.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.0802091Z 
2025-05-07T20:25:46.0802095Z 
2025-05-07T20:25:46.0802099Z 
2025-05-07T20:25:46.0802122Z 
2025-05-07T20:25:46.0802126Z 
2025-05-07T20:25:46.0802130Z 
2025-05-07T20:25:46.0802133Z 
2025-05-07T20:25:46.0802137Z 
2025-05-07T20:25:46.0802140Z 
2025-05-07T20:25:46.0802144Z 
2025-05-07T20:25:46.0802148Z 
2025-05-07T20:25:46.0802152Z 
2025-05-07T20:25:46.0802156Z 
2025-05-07T20:25:46.0802178Z 
2025-05-07T20:25:46.0802182Z 
2025-05-07T20:25:46.0802185Z 
2025-05-07T20:25:46.0802189Z 
2025-05-07T20:25:46.0802193Z 
2025-05-07T20:25:46.0804487Z 
2025-05-07T20:25:46.1611299Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.1611617Z 
2025-05-07T20:25:46.1611620Z 
2025-05-07T20:25:46.1611624Z 
2025-05-07T20:25:46.1611628Z 
2025-05-07T20:25:46.1611632Z 
2025-05-07T20:25:46.1611635Z 
2025-05-07T20:25:46.1611639Z 
2025-05-07T20:25:46.1611643Z 
2025-05-07T20:25:46.1611647Z 
2025-05-07T20:25:46.1611650Z 
2025-05-07T20:25:46.1611654Z 
2025-05-07T20:25:46.1611658Z 
2025-05-07T20:25:46.1611661Z 
2025-05-07T20:25:46.1611665Z 
2025-05-07T20:25:46.1611669Z 
2025-05-07T20:25:46.1611700Z 
2025-05-07T20:25:46.1611704Z 
2025-05-07T20:25:46.1613347Z 
2025-05-07T20:25:46.1801855Z cuda-nvvm-impl-12.6. | 7.7 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.1802303Z 
2025-05-07T20:25:46.1802309Z 
2025-05-07T20:25:46.1802329Z 
2025-05-07T20:25:46.1802335Z 
2025-05-07T20:25:46.1802340Z 
2025-05-07T20:25:46.1802356Z 
2025-05-07T20:25:46.1802362Z 
2025-05-07T20:25:46.1802371Z 
2025-05-07T20:25:46.1802376Z 
2025-05-07T20:25:46.1802433Z 
2025-05-07T20:25:46.1802439Z 
2025-05-07T20:25:46.1802442Z 
2025-05-07T20:25:46.1802446Z 
2025-05-07T20:25:46.1802548Z 
2025-05-07T20:25:46.1802554Z 
2025-05-07T20:25:46.1802562Z 
2025-05-07T20:25:46.1802569Z 
2025-05-07T20:25:46.1802574Z 
2025-05-07T20:25:46.1802629Z 
2025-05-07T20:25:46.3066355Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.3066781Z 
2025-05-07T20:25:46.3066789Z 
2025-05-07T20:25:46.3066795Z 
2025-05-07T20:25:46.3066801Z 
2025-05-07T20:25:46.3066847Z 
2025-05-07T20:25:46.3066853Z 
2025-05-07T20:25:46.3066860Z 
2025-05-07T20:25:46.3066867Z 
2025-05-07T20:25:46.3066873Z 
2025-05-07T20:25:46.3066880Z 
2025-05-07T20:25:46.3066897Z 
2025-05-07T20:25:46.3066903Z 
2025-05-07T20:25:46.3066908Z 
2025-05-07T20:25:46.3066932Z 
2025-05-07T20:25:46.3066938Z 
2025-05-07T20:25:46.3066944Z 
2025-05-07T20:25:46.3066950Z 
2025-05-07T20:25:46.3426037Z cuda-sanitizer-api-1 | 8.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.3426444Z 
2025-05-07T20:25:46.3426450Z 
2025-05-07T20:25:46.3426455Z 
2025-05-07T20:25:46.3426460Z 
2025-05-07T20:25:46.3426465Z 
2025-05-07T20:25:46.3426470Z 
2025-05-07T20:25:46.3426476Z 
2025-05-07T20:25:46.3426481Z 
2025-05-07T20:25:46.3426486Z 
2025-05-07T20:25:46.3426492Z 
2025-05-07T20:25:46.3426497Z 
2025-05-07T20:25:46.3426503Z 
2025-05-07T20:25:46.3426509Z 
2025-05-07T20:25:46.3426515Z 
2025-05-07T20:25:46.3426521Z 
2025-05-07T20:25:46.3426526Z 
2025-05-07T20:25:46.3426892Z 
2025-05-07T20:25:46.3426900Z 
2025-05-07T20:25:46.3427128Z 
2025-05-07T20:25:47.3387472Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.3387789Z 
2025-05-07T20:25:47.3387794Z 
2025-05-07T20:25:47.3387797Z 
2025-05-07T20:25:47.3388075Z 
2025-05-07T20:25:47.3388094Z 
2025-05-07T20:25:47.3388097Z 
2025-05-07T20:25:48.3027396Z libcusolver-11.7.1.2 | 95.8 MB   | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:48.3027788Z 
2025-05-07T20:25:48.3027792Z 
2025-05-07T20:25:48.3027796Z 
2025-05-07T20:25:48.3027800Z 
2025-05-07T20:25:48.3027807Z 
2025-05-07T20:25:48.4399184Z cuda-nvvp-12.6.80    | 109.3 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:48.4399592Z 
2025-05-07T20:25:48.4399597Z 
2025-05-07T20:25:48.4399603Z 
2025-05-07T20:25:48.4399608Z 
2025-05-07T20:25:48.4399614Z 
2025-05-07T20:25:48.4399619Z 
2025-05-07T20:25:48.4399624Z 
2025-05-07T20:25:48.4399628Z 
2025-05-07T20:25:48.4399638Z 
2025-05-07T20:25:48.6998033Z libcurand-10.3.7.77  | 39.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.6998346Z 
2025-05-07T20:25:48.6998350Z 
2025-05-07T20:25:48.6998354Z 
2025-05-07T20:25:48.6998358Z 
2025-05-07T20:25:48.6998361Z 
2025-05-07T20:25:48.6998365Z 
2025-05-07T20:25:48.6998369Z 
2025-05-07T20:25:48.6998394Z 
2025-05-07T20:25:48.7445971Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:48.7446359Z 
2025-05-07T20:25:48.7446365Z 
2025-05-07T20:25:48.7446370Z 
2025-05-07T20:25:48.7446376Z 
2025-05-07T20:25:48.7446381Z 
2025-05-07T20:25:48.7446386Z 
2025-05-07T20:25:48.7446391Z 
2025-05-07T20:25:49.1323329Z libnpp-12.3.1.54     | 93.4 MB   | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:49.1323715Z 
2025-05-07T20:25:49.1323719Z 
2025-05-07T20:25:49.1323723Z 
2025-05-07T20:25:49.1323726Z 
2025-05-07T20:25:49.1323730Z 
2025-05-07T20:25:49.1323734Z 
2025-05-07T20:25:49.1323739Z 
2025-05-07T20:25:49.1323743Z 
2025-05-07T20:25:49.1323747Z 
2025-05-07T20:25:49.1323754Z 
2025-05-07T20:25:49.2660203Z gds-tools-1.11.1.6   | 37.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.2906721Z nsight-compute-2024. | 443.1 MB  | ########## | 100% 
2025-05-07T20:25:49.2907020Z 
2025-05-07T20:25:49.2907024Z 
2025-05-07T20:25:49.2907028Z 
2025-05-07T20:25:49.2907052Z 
2025-05-07T20:25:49.2907056Z 
2025-05-07T20:25:49.2907059Z 
2025-05-07T20:25:49.2907063Z 
2025-05-07T20:25:49.2907067Z 
2025-05-07T20:25:49.2907070Z 
2025-05-07T20:25:49.2907074Z 
2025-05-07T20:25:49.2907078Z 
2025-05-07T20:25:49.2907089Z 
2025-05-07T20:25:49.2907097Z 
2025-05-07T20:25:49.7279747Z cuda-nvrtc-12.6.85   | 17.3 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.7280095Z 
2025-05-07T20:25:49.7280099Z 
2025-05-07T20:25:49.7280103Z 
2025-05-07T20:25:49.7280107Z 
2025-05-07T20:25:49.7280111Z 
2025-05-07T20:25:49.7280114Z 
2025-05-07T20:25:49.7280118Z 
2025-05-07T20:25:49.7280122Z 
2025-05-07T20:25:49.7280126Z 
2025-05-07T20:25:49.7280130Z 
2025-05-07T20:25:49.7280162Z 
2025-05-07T20:25:50.0402186Z cuda-nvcc-tools-12.6 | 23.0 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.0402519Z 
2025-05-07T20:25:50.0402523Z 
2025-05-07T20:25:50.0402527Z 
2025-05-07T20:25:50.0402532Z 
2025-05-07T20:25:50.0402537Z 
2025-05-07T20:25:50.0402564Z 
2025-05-07T20:25:50.0402568Z 
2025-05-07T20:25:50.0402572Z 
2025-05-07T20:25:50.0402575Z 
2025-05-07T20:25:50.0402579Z 
2025-05-07T20:25:50.0402583Z 
2025-05-07T20:25:50.0402587Z 
2025-05-07T20:25:50.0402590Z 
2025-05-07T20:25:50.0402594Z 
2025-05-07T20:25:50.2861286Z libnvjitlink-12.6.85 | 14.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.2861623Z 
2025-05-07T20:25:50.2861628Z 
2025-05-07T20:25:50.2861632Z 
2025-05-07T20:25:50.2861635Z 
2025-05-07T20:25:50.2861639Z 
2025-05-07T20:25:50.2861643Z 
2025-05-07T20:25:50.2861655Z 
2025-05-07T20:25:50.2861659Z 
2025-05-07T20:25:50.2861662Z 
2025-05-07T20:25:50.2861666Z 
2025-05-07T20:25:50.2861669Z 
2025-05-07T20:25:50.2861908Z 
2025-05-07T20:25:50.2861913Z 
2025-05-07T20:25:50.2861917Z 
2025-05-07T20:25:50.2861920Z 
2025-05-07T20:25:50.2861924Z 
2025-05-07T20:25:50.6574796Z cuda-nvvm-tools-12.6 | 10.4 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.6575412Z 
2025-05-07T20:25:50.6575416Z 
2025-05-07T20:25:50.6575420Z 
2025-05-07T20:25:50.6575423Z 
2025-05-07T20:25:50.6575427Z 
2025-05-07T20:25:50.6575431Z 
2025-05-07T20:25:50.6575435Z 
2025-05-07T20:25:50.6575438Z 
2025-05-07T20:25:50.6575442Z 
2025-05-07T20:25:50.6575446Z 
2025-05-07T20:25:50.6575449Z 
2025-05-07T20:25:50.6575453Z 
2025-05-07T20:25:50.6620072Z python-3.9.18        | 22.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.6620366Z 
2025-05-07T20:25:50.6620370Z 
2025-05-07T20:25:50.6620374Z 
2025-05-07T20:25:50.6620377Z 
2025-05-07T20:25:50.6620381Z 
2025-05-07T20:25:50.6620386Z 
2025-05-07T20:25:50.6620389Z 
2025-05-07T20:25:50.6620393Z 
2025-05-07T20:25:50.6620403Z 
2025-05-07T20:25:50.6620421Z 
2025-05-07T20:25:50.6620425Z 
2025-05-07T20:25:50.6620428Z 
2025-05-07T20:25:50.6620432Z 
2025-05-07T20:25:50.6620436Z 
2025-05-07T20:25:50.6621094Z 
2025-05-07T20:25:50.8254876Z cuda-nvcc-dev_linux- | 10.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.8255255Z 
2025-05-07T20:25:50.8255259Z 
2025-05-07T20:25:50.8255263Z 
2025-05-07T20:25:50.8255266Z 
2025-05-07T20:25:50.8255270Z 
2025-05-07T20:25:50.8255274Z 
2025-05-07T20:25:50.8255278Z 
2025-05-07T20:25:50.8255281Z 
2025-05-07T20:25:50.8255285Z 
2025-05-07T20:25:50.8255289Z 
2025-05-07T20:25:50.8255293Z 
2025-05-07T20:25:50.8255296Z 
2025-05-07T20:25:50.8255300Z 
2025-05-07T20:25:50.8255304Z 
2025-05-07T20:25:50.8255308Z 
2025-05-07T20:25:50.8255311Z 
2025-05-07T20:25:50.8255315Z 
2025-05-07T20:25:50.8255319Z 
2025-05-07T20:25:50.8794887Z cuda-nvvm-impl-12.6. | 7.7 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.8795297Z 
2025-05-07T20:25:50.8795331Z 
2025-05-07T20:25:50.8795335Z 
2025-05-07T20:25:50.8795348Z 
2025-05-07T20:25:50.8795352Z 
2025-05-07T20:25:50.8795356Z 
2025-05-07T20:25:50.8795359Z 
2025-05-07T20:25:50.8795363Z 
2025-05-07T20:25:50.8795366Z 
2025-05-07T20:25:50.8795370Z 
2025-05-07T20:25:50.8795374Z 
2025-05-07T20:25:50.8795387Z 
2025-05-07T20:25:50.8795391Z 
2025-05-07T20:25:50.8795394Z 
2025-05-07T20:25:50.8795398Z 
2025-05-07T20:25:50.8795401Z 
2025-05-07T20:25:50.8795405Z 
2025-05-07T20:25:51.1191111Z cuda-sanitizer-api-1 | 8.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:51.1191471Z 
2025-05-07T20:25:51.1191475Z 
2025-05-07T20:25:51.1191479Z 
2025-05-07T20:25:51.1191483Z 
2025-05-07T20:25:51.1191487Z 
2025-05-07T20:25:51.1191499Z 
2025-05-07T20:25:51.1191503Z 
2025-05-07T20:25:51.1191507Z 
2025-05-07T20:25:51.1191510Z 
2025-05-07T20:25:51.1191514Z 
2025-05-07T20:25:51.1191518Z 
2025-05-07T20:25:51.1191522Z 
2025-05-07T20:25:51.1191527Z 
2025-05-07T20:25:51.1191531Z 
2025-05-07T20:25:51.1191562Z 
2025-05-07T20:25:51.1191566Z 
2025-05-07T20:25:51.1191569Z 
2025-05-07T20:25:51.1191573Z 
2025-05-07T20:25:51.1191577Z 
2025-05-07T20:25:52.5289226Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:52.5290190Z 
2025-05-07T20:25:57.1700800Z libcublas-12.6.4.1   | 256.2 MB  | ########## | 100% [A
2025-05-07T20:25:57.1709249Z nsight-compute-2024. | 443.1 MB  | ########## | 100% 
2025-05-07T20:25:57.1709579Z 
2025-05-07T20:25:57.1709594Z 
2025-05-07T20:25:57.1709597Z 
2025-05-07T20:25:57.1709601Z 
2025-05-07T20:25:57.1709605Z 
2025-05-07T20:25:57.1709608Z 
2025-05-07T20:25:57.1709612Z 
2025-05-07T20:25:57.1709616Z 
2025-05-07T20:25:57.1709620Z 
2025-05-07T20:25:57.1709631Z 
2025-05-07T20:25:57.1709635Z 
2025-05-07T20:25:57.1709638Z 
2025-05-07T20:25:57.1709642Z 
2025-05-07T20:25:57.1709646Z 
2025-05-07T20:25:57.1709649Z 
2025-05-07T20:25:57.1709653Z 
2025-05-07T20:25:57.1709657Z 
2025-05-07T20:25:57.1709660Z 
2025-05-07T20:25:57.1709664Z 
2025-05-07T20:25:57.1710050Z                       
2025-05-07T20:25:57.1710442Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1710794Z                                                      
2025-05-07T20:25:57.1711043Z 
2025-05-07T20:25:57.1711221Z                                                      [A
2025-05-07T20:25:57.1711588Z 
2025-05-07T20:25:57.1711592Z 
2025-05-07T20:25:57.1711760Z                                                      [A[A
2025-05-07T20:25:57.1711962Z 
2025-05-07T20:25:57.1711966Z 
2025-05-07T20:25:57.1711970Z 
2025-05-07T20:25:57.1712146Z                                                      [A[A[A
2025-05-07T20:25:57.1712347Z 
2025-05-07T20:25:57.1712351Z 
2025-05-07T20:25:57.1712355Z 
2025-05-07T20:25:57.1712358Z 
2025-05-07T20:25:57.1712530Z                                                      [A[A[A[A
2025-05-07T20:25:57.1712744Z 
2025-05-07T20:25:57.1712748Z 
2025-05-07T20:25:57.1712752Z 
2025-05-07T20:25:57.1712755Z 
2025-05-07T20:25:57.1712759Z 
2025-05-07T20:25:57.1713165Z                                                      [A[A[A[A[A
2025-05-07T20:25:57.1713486Z 
2025-05-07T20:25:57.1713492Z 
2025-05-07T20:25:57.1713497Z 
2025-05-07T20:25:57.1713503Z 
2025-05-07T20:25:57.1713508Z 
2025-05-07T20:25:57.1713513Z 
2025-05-07T20:25:57.1713790Z                                                      [A[A[A[A[A[A
2025-05-07T20:25:57.1714100Z 
2025-05-07T20:25:57.1714106Z 
2025-05-07T20:25:57.1714112Z 
2025-05-07T20:25:57.1714117Z 
2025-05-07T20:25:57.1714122Z 
2025-05-07T20:25:57.1714128Z 
2025-05-07T20:25:57.1714153Z 
2025-05-07T20:25:57.1714357Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:25:57.1714665Z 
2025-05-07T20:25:57.1714671Z 
2025-05-07T20:25:57.1714677Z 
2025-05-07T20:25:57.1714683Z 
2025-05-07T20:25:57.1714689Z 
2025-05-07T20:25:57.1714694Z 
2025-05-07T20:25:57.1714708Z 
2025-05-07T20:25:57.1714712Z 
2025-05-07T20:25:57.1714915Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1715225Z 
2025-05-07T20:25:57.1715231Z 
2025-05-07T20:25:57.1715236Z 
2025-05-07T20:25:57.1715249Z 
2025-05-07T20:25:57.1715255Z 
2025-05-07T20:25:57.1715260Z 
2025-05-07T20:25:57.1715265Z 
2025-05-07T20:25:57.1715271Z 
2025-05-07T20:25:57.1715283Z 
2025-05-07T20:25:57.1715484Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1715782Z 
2025-05-07T20:25:57.1715785Z 
2025-05-07T20:25:57.1715789Z 
2025-05-07T20:25:57.1715793Z 
2025-05-07T20:25:57.1715796Z 
2025-05-07T20:25:57.1715800Z 
2025-05-07T20:25:57.1715804Z 
2025-05-07T20:25:57.1715807Z 
2025-05-07T20:25:57.1715811Z 
2025-05-07T20:25:57.1715815Z 
2025-05-07T20:25:57.1716014Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1716320Z 
2025-05-07T20:25:57.1716324Z 
2025-05-07T20:25:57.1716327Z 
2025-05-07T20:25:57.1716331Z 
2025-05-07T20:25:57.1716335Z 
2025-05-07T20:25:57.1716338Z 
2025-05-07T20:25:57.1716342Z 
2025-05-07T20:25:57.1716354Z 
2025-05-07T20:25:57.1716358Z 
2025-05-07T20:25:57.1716361Z 
2025-05-07T20:25:57.1716365Z 
2025-05-07T20:25:57.1716579Z                                                      [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1716868Z 
2025-05-07T20:25:57.1716879Z 
2025-05-07T20:25:57.1716883Z 
2025-05-07T20:25:57.1716886Z 
2025-05-07T20:25:57.1716890Z 
2025-05-07T20:25:57.1716894Z 
2025-05-07T20:25:57.1716897Z 
2025-05-07T20:25:57.1716901Z 
2025-05-07T20:25:57.1716904Z 
2025-05-07T20:25:57.1716908Z 
2025-05-07T20:25:57.1716911Z 
2025-05-07T20:25:57.1716915Z 
2025-05-07T20:25:57.1717127Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1717412Z 
2025-05-07T20:25:57.1717417Z 
2025-05-07T20:25:57.1717423Z 
2025-05-07T20:25:57.1717428Z 
2025-05-07T20:25:57.1717434Z 
2025-05-07T20:25:57.1717449Z 
2025-05-07T20:25:57.1717454Z 
2025-05-07T20:25:57.1717461Z 
2025-05-07T20:25:57.1717467Z 
2025-05-07T20:25:57.1717473Z 
2025-05-07T20:25:57.1717622Z 
2025-05-07T20:25:57.1717627Z 
2025-05-07T20:25:57.1717631Z 
2025-05-07T20:25:57.1717885Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1718188Z 
2025-05-07T20:25:57.1718192Z 
2025-05-07T20:25:57.1718282Z 
2025-05-07T20:25:57.1718286Z 
2025-05-07T20:25:57.1718289Z 
2025-05-07T20:25:57.1718293Z 
2025-05-07T20:25:57.1718296Z 
2025-05-07T20:25:57.1718300Z 
2025-05-07T20:25:57.1718304Z 
2025-05-07T20:25:57.1718308Z 
2025-05-07T20:25:57.1718311Z 
2025-05-07T20:25:57.1718315Z 
2025-05-07T20:25:57.1718318Z 
2025-05-07T20:25:57.1718322Z 
2025-05-07T20:25:57.1718637Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1718878Z 
2025-05-07T20:25:57.1718882Z 
2025-05-07T20:25:57.1718886Z 
2025-05-07T20:25:57.1718889Z 
2025-05-07T20:25:57.1718893Z 
2025-05-07T20:25:57.1718897Z 
2025-05-07T20:25:57.1718900Z 
2025-05-07T20:25:57.1718904Z 
2025-05-07T20:25:57.1718910Z 
2025-05-07T20:25:57.1718924Z 
2025-05-07T20:25:57.1718930Z 
2025-05-07T20:25:57.1718935Z 
2025-05-07T20:25:57.1718949Z 
2025-05-07T20:25:57.1718954Z 
2025-05-07T20:25:57.1718959Z 
2025-05-07T20:25:57.1719250Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1719568Z 
2025-05-07T20:25:57.1719581Z 
2025-05-07T20:25:57.1719587Z 
2025-05-07T20:25:57.1719592Z 
2025-05-07T20:25:57.1719597Z 
2025-05-07T20:25:57.1719602Z 
2025-05-07T20:25:57.1719607Z 
2025-05-07T20:25:57.1719612Z 
2025-05-07T20:25:57.1719617Z 
2025-05-07T20:25:57.1719622Z 
2025-05-07T20:25:57.1719626Z 
2025-05-07T20:25:57.1719631Z 
2025-05-07T20:25:57.1719636Z 
2025-05-07T20:25:57.1719641Z 
2025-05-07T20:25:57.1719646Z 
2025-05-07T20:25:57.1719652Z 
2025-05-07T20:25:57.1719940Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1720174Z 
2025-05-07T20:25:57.1720177Z 
2025-05-07T20:25:57.1720181Z 
2025-05-07T20:25:57.1720185Z 
2025-05-07T20:25:57.1720193Z 
2025-05-07T20:25:57.1720200Z 
2025-05-07T20:25:57.1720204Z 
2025-05-07T20:25:57.1720208Z 
2025-05-07T20:25:57.1720211Z 
2025-05-07T20:25:57.1721773Z 
2025-05-07T20:25:57.1721958Z 
2025-05-07T20:25:57.1721962Z 
2025-05-07T20:25:57.1721966Z 
2025-05-07T20:25:57.1721991Z 
2025-05-07T20:25:57.1722104Z 
2025-05-07T20:25:57.1722108Z 
2025-05-07T20:25:57.1722118Z 
2025-05-07T20:25:57.1722557Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1722856Z 
2025-05-07T20:25:57.1722862Z 
2025-05-07T20:25:57.1722883Z 
2025-05-07T20:25:57.1722888Z 
2025-05-07T20:25:57.1722894Z 
2025-05-07T20:25:57.1722899Z 
2025-05-07T20:25:57.1722904Z 
2025-05-07T20:25:57.1722909Z 
2025-05-07T20:25:57.1722915Z 
2025-05-07T20:25:57.1722919Z 
2025-05-07T20:25:57.1722936Z 
2025-05-07T20:25:57.1722941Z 
2025-05-07T20:25:57.1722946Z 
2025-05-07T20:25:57.1722951Z 
2025-05-07T20:25:57.1722956Z 
2025-05-07T20:25:57.1722961Z 
2025-05-07T20:25:57.1722977Z 
2025-05-07T20:25:57.1722982Z 
2025-05-07T20:25:57.1723238Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1723543Z 
2025-05-07T20:25:57.1723550Z 
2025-05-07T20:25:57.1723657Z [A
2025-05-07T20:25:57.1723765Z 
2025-05-07T20:25:57.1723769Z 
2025-05-07T20:25:57.1723872Z [A[A
2025-05-07T20:25:57.1723982Z 
2025-05-07T20:25:57.1723986Z 
2025-05-07T20:25:57.1723989Z 
2025-05-07T20:25:57.1724099Z [A[A[A
2025-05-07T20:25:57.1724208Z 
2025-05-07T20:25:57.1724211Z 
2025-05-07T20:25:57.1724216Z 
2025-05-07T20:25:57.1724219Z 
2025-05-07T20:25:57.1724326Z [A[A[A[A
2025-05-07T20:25:57.1724482Z 
2025-05-07T20:25:57.1724487Z 
2025-05-07T20:25:57.1724492Z 
2025-05-07T20:25:57.1724497Z 
2025-05-07T20:25:57.1724502Z 
2025-05-07T20:25:57.1724647Z [A[A[A[A[A
2025-05-07T20:25:57.1724786Z 
2025-05-07T20:25:57.1724790Z 
2025-05-07T20:25:57.1724793Z 
2025-05-07T20:25:57.1724797Z 
2025-05-07T20:25:57.1724800Z 
2025-05-07T20:25:57.1725288Z 
2025-05-07T20:25:57.1725459Z [A[A[A[A[A[A
2025-05-07T20:25:57.1725653Z 
2025-05-07T20:25:57.1725656Z 
2025-05-07T20:25:57.1725660Z 
2025-05-07T20:25:57.1725664Z 
2025-05-07T20:25:57.1725667Z 
2025-05-07T20:25:57.1725671Z 
2025-05-07T20:25:57.1725674Z 
2025-05-07T20:25:57.1725896Z [A[A[A[A[A[A[A
2025-05-07T20:25:57.1726039Z 
2025-05-07T20:25:57.1726043Z 
2025-05-07T20:25:57.1726046Z 
2025-05-07T20:25:57.1726050Z 
2025-05-07T20:25:57.1726054Z 
2025-05-07T20:25:57.1726057Z 
2025-05-07T20:25:57.1726061Z 
2025-05-07T20:25:57.1726065Z 
2025-05-07T20:25:57.1726192Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1726341Z 
2025-05-07T20:25:57.1726345Z 
2025-05-07T20:25:57.1726348Z 
2025-05-07T20:25:57.1726352Z 
2025-05-07T20:25:57.1726356Z 
2025-05-07T20:25:57.1726359Z 
2025-05-07T20:25:57.1726363Z 
2025-05-07T20:25:57.1726366Z 
2025-05-07T20:25:57.1726370Z 
2025-05-07T20:25:57.1726518Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1726725Z 
2025-05-07T20:25:57.1726729Z 
2025-05-07T20:25:57.1726741Z 
2025-05-07T20:25:57.1726745Z 
2025-05-07T20:25:57.1726749Z 
2025-05-07T20:25:57.1726752Z 
2025-05-07T20:25:57.1726756Z 
2025-05-07T20:25:57.1726760Z 
2025-05-07T20:25:57.1726764Z 
2025-05-07T20:25:57.1726767Z 
2025-05-07T20:25:57.1726906Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1727080Z 
2025-05-07T20:25:57.1727084Z 
2025-05-07T20:25:57.1727101Z 
2025-05-07T20:25:57.1727105Z 
2025-05-07T20:25:57.1727109Z 
2025-05-07T20:25:57.1727112Z 
2025-05-07T20:25:57.1727116Z 
2025-05-07T20:25:57.1727126Z 
2025-05-07T20:25:57.1727130Z 
2025-05-07T20:25:57.1727134Z 
2025-05-07T20:25:57.1727138Z 
2025-05-07T20:25:57.1727270Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1727444Z 
2025-05-07T20:25:57.1727448Z 
2025-05-07T20:25:57.1727451Z 
2025-05-07T20:25:57.1727462Z 
2025-05-07T20:25:57.1727465Z 
2025-05-07T20:25:57.1727469Z 
2025-05-07T20:25:57.1727473Z 
2025-05-07T20:25:57.1727476Z 
2025-05-07T20:25:57.1727480Z 
2025-05-07T20:25:57.1727483Z 
2025-05-07T20:25:57.1727487Z 
2025-05-07T20:25:57.1727498Z 
2025-05-07T20:25:57.1727683Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1727889Z 
2025-05-07T20:25:57.1727893Z 
2025-05-07T20:25:57.1727896Z 
2025-05-07T20:25:57.1727900Z 
2025-05-07T20:25:57.1727904Z 
2025-05-07T20:25:57.1727907Z 
2025-05-07T20:25:57.1727911Z 
2025-05-07T20:25:57.1727919Z 
2025-05-07T20:25:57.1727923Z 
2025-05-07T20:25:57.1727926Z 
2025-05-07T20:25:57.1727930Z 
2025-05-07T20:25:57.1727934Z 
2025-05-07T20:25:57.1727937Z 
2025-05-07T20:25:57.1728082Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1728334Z 
2025-05-07T20:25:57.1728338Z 
2025-05-07T20:25:57.1728342Z 
2025-05-07T20:25:57.1728345Z 
2025-05-07T20:25:57.1728349Z 
2025-05-07T20:25:57.1728353Z 
2025-05-07T20:25:57.1728356Z 
2025-05-07T20:25:57.1728360Z 
2025-05-07T20:25:57.1728363Z 
2025-05-07T20:25:57.1728367Z 
2025-05-07T20:25:57.1728371Z 
2025-05-07T20:25:57.1728374Z 
2025-05-07T20:25:57.1728378Z 
2025-05-07T20:25:57.1728381Z 
2025-05-07T20:25:57.1728550Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1728812Z 
2025-05-07T20:25:57.1728816Z 
2025-05-07T20:25:57.1728819Z 
2025-05-07T20:25:57.1728823Z 
2025-05-07T20:25:57.1728826Z 
2025-05-07T20:25:57.1728830Z 
2025-05-07T20:25:57.1728846Z 
2025-05-07T20:25:57.1728850Z 
2025-05-07T20:25:57.1728853Z 
2025-05-07T20:25:57.1728861Z 
2025-05-07T20:25:57.1728865Z 
2025-05-07T20:25:57.1728868Z 
2025-05-07T20:25:57.1728872Z 
2025-05-07T20:25:57.1728876Z 
2025-05-07T20:25:57.1728879Z 
2025-05-07T20:25:57.1729035Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1729297Z 
2025-05-07T20:25:57.1729301Z 
2025-05-07T20:25:57.1729304Z 
2025-05-07T20:25:57.1729308Z 
2025-05-07T20:25:57.1729312Z 
2025-05-07T20:25:57.1729315Z 
2025-05-07T20:25:57.1729319Z 
2025-05-07T20:25:57.1729322Z 
2025-05-07T20:25:57.1729326Z 
2025-05-07T20:25:57.1729329Z 
2025-05-07T20:25:57.1729333Z 
2025-05-07T20:25:57.1729336Z 
2025-05-07T20:25:57.1729340Z 
2025-05-07T20:25:57.1729344Z 
2025-05-07T20:25:57.1729347Z 
2025-05-07T20:25:57.1729351Z 
2025-05-07T20:25:57.1745870Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1746170Z 
2025-05-07T20:25:57.1746175Z 
2025-05-07T20:25:57.1746181Z 
2025-05-07T20:25:57.1746186Z 
2025-05-07T20:25:57.1746191Z 
2025-05-07T20:25:57.1746196Z 
2025-05-07T20:25:57.1746201Z 
2025-05-07T20:25:57.1746312Z 
2025-05-07T20:25:57.1746318Z 
2025-05-07T20:25:57.1746323Z 
2025-05-07T20:25:57.1746328Z 
2025-05-07T20:25:57.1746333Z 
2025-05-07T20:25:57.1746339Z 
2025-05-07T20:25:57.1746344Z 
2025-05-07T20:25:57.1746360Z 
2025-05-07T20:25:57.1746365Z 
2025-05-07T20:25:57.1746371Z 
2025-05-07T20:25:57.1746669Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1746940Z 
2025-05-07T20:25:57.1746944Z 
2025-05-07T20:25:57.1746948Z 
2025-05-07T20:25:57.1746952Z 
2025-05-07T20:25:57.1746958Z 
2025-05-07T20:25:57.1746962Z 
2025-05-07T20:25:57.1746965Z 
2025-05-07T20:25:57.1746969Z 
2025-05-07T20:25:57.1746973Z 
2025-05-07T20:25:57.1746976Z 
2025-05-07T20:25:57.1746980Z 
2025-05-07T20:25:57.1746983Z 
2025-05-07T20:25:57.1746996Z 
2025-05-07T20:25:57.1747000Z 
2025-05-07T20:25:57.1747003Z 
2025-05-07T20:25:57.1747007Z 
2025-05-07T20:25:57.1747010Z 
2025-05-07T20:25:57.1747014Z 
2025-05-07T20:25:57.1747205Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1747419Z 
2025-05-07T20:25:57.1747428Z 
2025-05-07T20:25:57.1747533Z [A
2025-05-07T20:25:57.1747681Z 
2025-05-07T20:25:57.1747687Z 
2025-05-07T20:25:57.1747825Z [A[A
2025-05-07T20:25:57.1747974Z 
2025-05-07T20:25:57.1747990Z 
2025-05-07T20:25:57.1747995Z 
2025-05-07T20:25:57.1748156Z [A[A[A
2025-05-07T20:25:57.1748330Z 
2025-05-07T20:25:57.1748336Z 
2025-05-07T20:25:57.1748342Z 
2025-05-07T20:25:57.1748347Z 
2025-05-07T20:25:57.1748519Z [A[A[A[A
2025-05-07T20:25:57.1748682Z 
2025-05-07T20:25:57.1748688Z 
2025-05-07T20:25:57.1748692Z 
2025-05-07T20:25:57.1748698Z 
2025-05-07T20:25:57.1748703Z 
2025-05-07T20:25:57.1748860Z [A[A[A[A[A
2025-05-07T20:25:57.1749029Z 
2025-05-07T20:25:57.1749034Z 
2025-05-07T20:25:57.1749039Z 
2025-05-07T20:25:57.1749052Z 
2025-05-07T20:25:57.1749058Z 
2025-05-07T20:25:57.1749062Z 
2025-05-07T20:25:57.1749225Z [A[A[A[A[A[A
2025-05-07T20:25:57.1749402Z 
2025-05-07T20:25:57.1749407Z 
2025-05-07T20:25:57.1749413Z 
2025-05-07T20:25:57.1749418Z 
2025-05-07T20:25:57.1749423Z 
2025-05-07T20:25:57.1749435Z 
2025-05-07T20:25:57.1749441Z 
2025-05-07T20:25:57.1749609Z [A[A[A[A[A[A[A
2025-05-07T20:25:57.1749909Z 
2025-05-07T20:25:57.1749914Z 
2025-05-07T20:25:57.1749920Z 
2025-05-07T20:25:57.1749925Z 
2025-05-07T20:25:57.1749930Z 
2025-05-07T20:25:57.1749935Z 
2025-05-07T20:25:57.1749941Z 
2025-05-07T20:25:57.1749946Z 
2025-05-07T20:25:57.1750125Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1750332Z 
2025-05-07T20:25:57.1750337Z 
2025-05-07T20:25:57.1750343Z 
2025-05-07T20:25:57.1750348Z 
2025-05-07T20:25:57.1750353Z 
2025-05-07T20:25:57.1750358Z 
2025-05-07T20:25:57.1750363Z 
2025-05-07T20:25:57.1750368Z 
2025-05-07T20:25:57.1750374Z 
2025-05-07T20:25:57.1750570Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1750750Z 
2025-05-07T20:25:57.1750754Z 
2025-05-07T20:25:57.1750758Z 
2025-05-07T20:25:57.1750761Z 
2025-05-07T20:25:57.1750765Z 
2025-05-07T20:25:57.1750864Z 
2025-05-07T20:25:57.1750868Z 
2025-05-07T20:25:57.1750871Z 
2025-05-07T20:25:57.1750875Z 
2025-05-07T20:25:57.1750883Z 
2025-05-07T20:25:57.1751021Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1751188Z 
2025-05-07T20:25:57.1751192Z 
2025-05-07T20:25:57.1751196Z 
2025-05-07T20:25:57.1751199Z 
2025-05-07T20:25:57.1751203Z 
2025-05-07T20:25:57.1751207Z 
2025-05-07T20:25:57.1751210Z 
2025-05-07T20:25:57.1751214Z 
2025-05-07T20:25:57.1751226Z 
2025-05-07T20:25:57.1751230Z 
2025-05-07T20:25:57.1751233Z 
2025-05-07T20:25:57.1751365Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1751540Z 
2025-05-07T20:25:57.1751543Z 
2025-05-07T20:25:57.1751547Z 
2025-05-07T20:25:57.1751551Z 
2025-05-07T20:25:57.1751554Z 
2025-05-07T20:25:57.1751566Z 
2025-05-07T20:25:57.1751569Z 
2025-05-07T20:25:57.1751573Z 
2025-05-07T20:25:57.1751577Z 
2025-05-07T20:25:57.1751712Z 
2025-05-07T20:25:57.1751717Z 
2025-05-07T20:25:57.1751720Z 
2025-05-07T20:25:57.1751866Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1752057Z 
2025-05-07T20:25:57.1752060Z 
2025-05-07T20:25:57.1752064Z 
2025-05-07T20:25:57.1752067Z 
2025-05-07T20:25:57.1752147Z 
2025-05-07T20:25:57.1752151Z 
2025-05-07T20:25:57.1752155Z 
2025-05-07T20:25:57.1752158Z 
2025-05-07T20:25:57.1752162Z 
2025-05-07T20:25:57.1752166Z 
2025-05-07T20:25:57.1752169Z 
2025-05-07T20:25:57.1752173Z 
2025-05-07T20:25:57.1752177Z 
2025-05-07T20:25:57.1752314Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1752516Z 
2025-05-07T20:25:57.1752519Z 
2025-05-07T20:25:57.1752523Z 
2025-05-07T20:25:57.1752526Z 
2025-05-07T20:25:57.1752530Z 
2025-05-07T20:25:57.1752534Z 
2025-05-07T20:25:57.1752537Z 
2025-05-07T20:25:57.1752541Z 
2025-05-07T20:25:57.1752544Z 
2025-05-07T20:25:57.1752548Z 
2025-05-07T20:25:57.1752552Z 
2025-05-07T20:25:57.1752555Z 
2025-05-07T20:25:57.1752559Z 
2025-05-07T20:25:57.1752563Z 
2025-05-07T20:25:57.1752721Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1752915Z 
2025-05-07T20:25:57.1752919Z 
2025-05-07T20:25:57.1752922Z 
2025-05-07T20:25:57.1752926Z 
2025-05-07T20:25:57.1752930Z 
2025-05-07T20:25:57.1752934Z 
2025-05-07T20:25:57.1752937Z 
2025-05-07T20:25:57.1752946Z 
2025-05-07T20:25:57.1752949Z 
2025-05-07T20:25:57.1752953Z 
2025-05-07T20:25:57.1752957Z 
2025-05-07T20:25:57.1752960Z 
2025-05-07T20:25:57.1752971Z 
2025-05-07T20:25:57.1752975Z 
2025-05-07T20:25:57.1752979Z 
2025-05-07T20:25:57.1753127Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1753325Z 
2025-05-07T20:25:57.1753329Z 
2025-05-07T20:25:57.1753332Z 
2025-05-07T20:25:57.1753336Z 
2025-05-07T20:25:57.1753347Z 
2025-05-07T20:25:57.1753351Z 
2025-05-07T20:25:57.1753355Z 
2025-05-07T20:25:57.1753359Z 
2025-05-07T20:25:57.1753362Z 
2025-05-07T20:25:57.1753366Z 
2025-05-07T20:25:57.1753370Z 
2025-05-07T20:25:57.1753373Z 
2025-05-07T20:25:57.1753377Z 
2025-05-07T20:25:57.1753380Z 
2025-05-07T20:25:57.1753387Z 
2025-05-07T20:25:57.1753391Z 
2025-05-07T20:25:57.1753544Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1753754Z 
2025-05-07T20:25:57.1753757Z 
2025-05-07T20:25:57.1753761Z 
2025-05-07T20:25:57.1753765Z 
2025-05-07T20:25:57.1753768Z 
2025-05-07T20:25:57.1753776Z 
2025-05-07T20:25:57.1753780Z 
2025-05-07T20:25:57.1753783Z 
2025-05-07T20:25:57.1753787Z 
2025-05-07T20:25:57.1753790Z 
2025-05-07T20:25:57.1753794Z 
2025-05-07T20:25:57.1753798Z 
2025-05-07T20:25:57.1753801Z 
2025-05-07T20:25:57.1753805Z 
2025-05-07T20:25:57.1753809Z 
2025-05-07T20:25:57.1753812Z 
2025-05-07T20:25:57.1753816Z 
2025-05-07T20:25:57.1753983Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1754188Z 
2025-05-07T20:25:57.1754192Z 
2025-05-07T20:25:57.1754196Z 
2025-05-07T20:25:57.1754199Z 
2025-05-07T20:25:57.1754203Z 
2025-05-07T20:25:57.1754207Z 
2025-05-07T20:25:57.1754210Z 
2025-05-07T20:25:57.1754214Z 
2025-05-07T20:25:57.1754217Z 
2025-05-07T20:25:57.1754227Z 
2025-05-07T20:25:57.1754235Z 
2025-05-07T20:25:57.1754238Z 
2025-05-07T20:25:57.1754242Z 
2025-05-07T20:25:57.1754246Z 
2025-05-07T20:25:57.1754249Z 
2025-05-07T20:25:57.1754253Z 
2025-05-07T20:25:57.1754256Z 
2025-05-07T20:25:57.1754260Z 
2025-05-07T20:25:57.1754422Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1754651Z 
2025-05-07T20:25:57.1754654Z 
2025-05-07T20:25:57.1754755Z [A
2025-05-07T20:25:57.1754858Z 
2025-05-07T20:25:57.1754861Z 
2025-05-07T20:25:57.1754968Z [A[A
2025-05-07T20:25:57.1755071Z 
2025-05-07T20:25:57.1755075Z 
2025-05-07T20:25:57.1755079Z 
2025-05-07T20:25:57.1755185Z [A[A[A
2025-05-07T20:25:57.1755296Z 
2025-05-07T20:25:57.1755300Z 
2025-05-07T20:25:57.1755304Z 
2025-05-07T20:25:57.1755308Z 
2025-05-07T20:25:57.1755423Z [A[A[A[A
2025-05-07T20:25:57.1755547Z 
2025-05-07T20:25:57.1755551Z 
2025-05-07T20:25:57.1755555Z 
2025-05-07T20:25:57.1755558Z 
2025-05-07T20:25:57.1755562Z 
2025-05-07T20:25:57.1755680Z [A[A[A[A[A
2025-05-07T20:25:57.1755806Z 
2025-05-07T20:25:57.1755913Z 
2025-05-07T20:25:57.1755917Z 
2025-05-07T20:25:57.1755921Z 
2025-05-07T20:25:57.1755925Z 
2025-05-07T20:25:57.1755928Z 
2025-05-07T20:25:57.1756048Z [A[A[A[A[A[A
2025-05-07T20:25:57.1756175Z 
2025-05-07T20:25:57.1756178Z 
2025-05-07T20:25:57.1756281Z 
2025-05-07T20:25:57.1756285Z 
2025-05-07T20:25:57.1756288Z 
2025-05-07T20:25:57.1756292Z 
2025-05-07T20:25:57.1756296Z 
2025-05-07T20:25:57.1756421Z [A[A[A[A[A[A[A
2025-05-07T20:25:57.1756564Z 
2025-05-07T20:25:57.1756568Z 
2025-05-07T20:25:57.1756571Z 
2025-05-07T20:25:57.1756575Z 
2025-05-07T20:25:57.1756579Z 
2025-05-07T20:25:57.1756582Z 
2025-05-07T20:25:57.1756586Z 
2025-05-07T20:25:57.1756590Z 
2025-05-07T20:25:57.1756714Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1756866Z 
2025-05-07T20:25:57.1756870Z 
2025-05-07T20:25:57.1756874Z 
2025-05-07T20:25:57.1756877Z 
2025-05-07T20:25:57.1756881Z 
2025-05-07T20:25:57.1756884Z 
2025-05-07T20:25:57.1756888Z 
2025-05-07T20:25:57.1756892Z 
2025-05-07T20:25:57.1756895Z 
2025-05-07T20:25:57.1757029Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1757185Z 
2025-05-07T20:25:57.1757189Z 
2025-05-07T20:25:57.1757192Z 
2025-05-07T20:25:57.1757196Z 
2025-05-07T20:25:57.1757199Z 
2025-05-07T20:25:57.1757203Z 
2025-05-07T20:25:57.1757207Z 
2025-05-07T20:25:57.1757215Z 
2025-05-07T20:25:57.1757218Z 
2025-05-07T20:25:57.1757222Z 
2025-05-07T20:25:57.1757354Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1757520Z 
2025-05-07T20:25:57.1757523Z 
2025-05-07T20:25:57.1757527Z 
2025-05-07T20:25:57.1757530Z 
2025-05-07T20:25:57.1757534Z 
2025-05-07T20:25:57.1757540Z 
2025-05-07T20:25:57.1757545Z 
2025-05-07T20:25:57.1757550Z 
2025-05-07T20:25:57.1757555Z 
2025-05-07T20:25:57.1757568Z 
2025-05-07T20:25:57.1757574Z 
2025-05-07T20:25:57.1757758Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1757979Z 
2025-05-07T20:25:57.1757982Z 
2025-05-07T20:25:57.1757986Z 
2025-05-07T20:25:57.1757990Z 
2025-05-07T20:25:57.1757993Z 
2025-05-07T20:25:57.1757997Z 
2025-05-07T20:25:57.1758007Z 
2025-05-07T20:25:57.1758017Z 
2025-05-07T20:25:57.1758021Z 
2025-05-07T20:25:57.1758024Z 
2025-05-07T20:25:57.1758028Z 
2025-05-07T20:25:57.1758032Z 
2025-05-07T20:25:57.1758178Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1758367Z 
2025-05-07T20:25:57.1758379Z 
2025-05-07T20:25:57.1758387Z 
2025-05-07T20:25:57.1758391Z 
2025-05-07T20:25:57.1758394Z 
2025-05-07T20:25:57.1758398Z 
2025-05-07T20:25:57.1758402Z 
2025-05-07T20:25:57.1758405Z 
2025-05-07T20:25:57.1758409Z 
2025-05-07T20:25:57.1758412Z 
2025-05-07T20:25:57.1758416Z 
2025-05-07T20:25:57.1758420Z 
2025-05-07T20:25:57.1758423Z 
2025-05-07T20:25:57.1758557Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1758749Z 
2025-05-07T20:25:57.1758752Z 
2025-05-07T20:25:57.1758756Z 
2025-05-07T20:25:57.1758760Z 
2025-05-07T20:25:57.1758763Z 
2025-05-07T20:25:57.1758767Z 
2025-05-07T20:25:57.1758770Z 
2025-05-07T20:25:57.1758774Z 
2025-05-07T20:25:57.1758778Z 
2025-05-07T20:25:57.1758781Z 
2025-05-07T20:25:57.1758785Z 
2025-05-07T20:25:57.1758792Z 
2025-05-07T20:25:57.1758796Z 
2025-05-07T20:25:57.1758799Z 
2025-05-07T20:25:57.1758945Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1759137Z 
2025-05-07T20:25:57.1759141Z 
2025-05-07T20:25:57.1759145Z 
2025-05-07T20:25:57.1759148Z 
2025-05-07T20:25:57.1759156Z 
2025-05-07T20:25:57.1759160Z 
2025-05-07T20:25:57.1759163Z 
2025-05-07T20:25:57.1759167Z 
2025-05-07T20:25:57.1759171Z 
2025-05-07T20:25:57.1759174Z 
2025-05-07T20:25:57.1759178Z 
2025-05-07T20:25:57.1759181Z 
2025-05-07T20:25:57.1759185Z 
2025-05-07T20:25:57.1759196Z 
2025-05-07T20:25:57.1759199Z 
2025-05-07T20:25:57.1759347Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1759546Z 
2025-05-07T20:25:57.1759550Z 
2025-05-07T20:25:57.1759553Z 
2025-05-07T20:25:57.1759557Z 
2025-05-07T20:25:57.1759561Z 
2025-05-07T20:25:57.1759574Z 
2025-05-07T20:25:57.1759577Z 
2025-05-07T20:25:57.1759581Z 
2025-05-07T20:25:57.1759585Z 
2025-05-07T20:25:57.1759588Z 
2025-05-07T20:25:57.1759592Z 
2025-05-07T20:25:57.1759596Z 
2025-05-07T20:25:57.1759692Z 
2025-05-07T20:25:57.1759696Z 
2025-05-07T20:25:57.1759700Z 
2025-05-07T20:25:57.1759704Z 
2025-05-07T20:25:57.1759865Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1760072Z 
2025-05-07T20:25:57.1760075Z 
2025-05-07T20:25:57.1760465Z 
2025-05-07T20:25:57.1760471Z 
2025-05-07T20:25:57.1760475Z 
2025-05-07T20:25:57.1760481Z 
2025-05-07T20:25:57.1760486Z 
2025-05-07T20:25:57.1760491Z 
2025-05-07T20:25:57.1760496Z 
2025-05-07T20:25:57.1760501Z 
2025-05-07T20:25:57.1760506Z 
2025-05-07T20:25:57.1760511Z 
2025-05-07T20:25:57.1760516Z 
2025-05-07T20:25:57.1760522Z 
2025-05-07T20:25:57.1760527Z 
2025-05-07T20:25:57.1760532Z 
2025-05-07T20:25:57.1760537Z 
2025-05-07T20:25:57.1760748Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1761013Z 
2025-05-07T20:25:57.1761018Z 
2025-05-07T20:25:57.1761023Z 
2025-05-07T20:25:57.1761029Z 
2025-05-07T20:25:57.1761034Z 
2025-05-07T20:25:57.1761037Z 
2025-05-07T20:25:57.1761041Z 
2025-05-07T20:25:57.1761044Z 
2025-05-07T20:25:57.1761056Z 
2025-05-07T20:25:57.1761060Z 
2025-05-07T20:25:57.1761073Z 
2025-05-07T20:25:57.1761077Z 
2025-05-07T20:25:57.1761080Z 
2025-05-07T20:25:57.1761084Z 
2025-05-07T20:25:57.1761088Z 
2025-05-07T20:25:57.1761091Z 
2025-05-07T20:25:57.1761095Z 
2025-05-07T20:25:57.1761105Z 
2025-05-07T20:25:57.1761289Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1761560Z 
2025-05-07T20:25:57.1761567Z 
2025-05-07T20:25:57.1761680Z [A
2025-05-07T20:25:57.1761824Z 
2025-05-07T20:25:57.1761828Z 
2025-05-07T20:25:57.1761943Z [A[A
2025-05-07T20:25:57.1762046Z 
2025-05-07T20:25:57.1762049Z 
2025-05-07T20:25:57.1762053Z 
2025-05-07T20:25:57.1762190Z [A[A[A
2025-05-07T20:25:57.1762331Z 
2025-05-07T20:25:57.1762335Z 
2025-05-07T20:25:57.1762339Z 
2025-05-07T20:25:57.1762342Z 
2025-05-07T20:25:57.1762462Z [A[A[A[A
2025-05-07T20:25:57.1762633Z 
2025-05-07T20:25:57.1762637Z 
2025-05-07T20:25:57.1762641Z 
2025-05-07T20:25:57.1762645Z 
2025-05-07T20:25:57.1762648Z 
2025-05-07T20:25:57.1762767Z [A[A[A[A[A
2025-05-07T20:25:57.1762896Z 
2025-05-07T20:25:57.1762899Z 
2025-05-07T20:25:57.1762903Z 
2025-05-07T20:25:57.1762907Z 
2025-05-07T20:25:57.1762910Z 
2025-05-07T20:25:57.1762914Z 
2025-05-07T20:25:57.1763021Z [A[A[A[A[A[A
2025-05-07T20:25:57.1763163Z 
2025-05-07T20:25:57.1763169Z 
2025-05-07T20:25:57.1763174Z 
2025-05-07T20:25:57.1763179Z 
2025-05-07T20:25:57.1763184Z 
2025-05-07T20:25:57.1763190Z 
2025-05-07T20:25:57.1763195Z 
2025-05-07T20:25:57.1763359Z [A[A[A[A[A[A[A
2025-05-07T20:25:57.1763501Z 
2025-05-07T20:25:57.1763511Z 
2025-05-07T20:25:57.1763514Z 
2025-05-07T20:25:57.1763518Z 
2025-05-07T20:25:57.1763522Z 
2025-05-07T20:25:57.1763525Z 
2025-05-07T20:25:57.1763529Z 
2025-05-07T20:25:57.1763533Z 
2025-05-07T20:25:57.1763704Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1763863Z 
2025-05-07T20:25:57.1763875Z 
2025-05-07T20:25:57.1763879Z 
2025-05-07T20:25:57.1763883Z 
2025-05-07T20:25:57.1763886Z 
2025-05-07T20:25:57.1763890Z 
2025-05-07T20:25:57.1763898Z 
2025-05-07T20:25:57.1763902Z 
2025-05-07T20:25:57.1763906Z 
2025-05-07T20:25:57.1764027Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1764186Z 
2025-05-07T20:25:57.1764189Z 
2025-05-07T20:25:57.1764193Z 
2025-05-07T20:25:57.1764197Z 
2025-05-07T20:25:57.1764200Z 
2025-05-07T20:25:57.1764208Z 
2025-05-07T20:25:57.1764212Z 
2025-05-07T20:25:57.1764216Z 
2025-05-07T20:25:57.1764219Z 
2025-05-07T20:25:57.1764223Z 
2025-05-07T20:25:57.1764370Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1764582Z 
2025-05-07T20:25:57.1764585Z 
2025-05-07T20:25:57.1764589Z 
2025-05-07T20:25:57.1764592Z 
2025-05-07T20:25:57.1764596Z 
2025-05-07T20:25:57.1764600Z 
2025-05-07T20:25:57.1764603Z 
2025-05-07T20:25:57.1764607Z 
2025-05-07T20:25:57.1764611Z 
2025-05-07T20:25:57.1764614Z 
2025-05-07T20:25:57.1764618Z 
2025-05-07T20:25:57.1764798Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1764986Z 
2025-05-07T20:25:57.1764990Z 
2025-05-07T20:25:57.1764995Z 
2025-05-07T20:25:57.1765001Z 
2025-05-07T20:25:57.1765007Z 
2025-05-07T20:25:57.1765131Z 
2025-05-07T20:25:57.1765138Z 
2025-05-07T20:25:57.1765143Z 
2025-05-07T20:25:57.1765149Z 
2025-05-07T20:25:57.1765154Z 
2025-05-07T20:25:57.1765160Z 
2025-05-07T20:25:57.1765165Z 
2025-05-07T20:25:57.1765317Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1765602Z 
2025-05-07T20:25:57.1765605Z 
2025-05-07T20:25:57.1765609Z 
2025-05-07T20:25:57.1765612Z 
2025-05-07T20:25:57.1765616Z 
2025-05-07T20:25:57.1765620Z 
2025-05-07T20:25:57.1765623Z 
2025-05-07T20:25:57.1765627Z 
2025-05-07T20:25:57.1765630Z 
2025-05-07T20:25:57.1765634Z 
2025-05-07T20:25:57.1765638Z 
2025-05-07T20:25:57.1765641Z 
2025-05-07T20:25:57.1765652Z 
2025-05-07T20:25:57.1765789Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1765972Z 
2025-05-07T20:25:57.1765976Z 
2025-05-07T20:25:57.1765980Z 
2025-05-07T20:25:57.1765983Z 
2025-05-07T20:25:57.1765987Z 
2025-05-07T20:25:57.1765990Z 
2025-05-07T20:25:57.1765994Z 
2025-05-07T20:25:57.1766004Z 
2025-05-07T20:25:57.1766008Z 
2025-05-07T20:25:57.1766017Z 
2025-05-07T20:25:57.1766021Z 
2025-05-07T20:25:57.1766025Z 
2025-05-07T20:25:57.1766028Z 
2025-05-07T20:25:57.1766032Z 
2025-05-07T20:25:57.1766170Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1766396Z 
2025-05-07T20:25:57.1766401Z 
2025-05-07T20:25:57.1766414Z 
2025-05-07T20:25:57.1766419Z 
2025-05-07T20:25:57.1766425Z 
2025-05-07T20:25:57.1766430Z 
2025-05-07T20:25:57.1766435Z 
2025-05-07T20:25:57.1766440Z 
2025-05-07T20:25:57.1766445Z 
2025-05-07T20:25:57.1766450Z 
2025-05-07T20:25:57.1766456Z 
2025-05-07T20:25:57.1766460Z 
2025-05-07T20:25:57.1766466Z 
2025-05-07T20:25:57.1766471Z 
2025-05-07T20:25:57.1766476Z 
2025-05-07T20:25:57.1766639Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1766846Z 
2025-05-07T20:25:57.1766850Z 
2025-05-07T20:25:57.1766853Z 
2025-05-07T20:25:57.1766857Z 
2025-05-07T20:25:57.1766861Z 
2025-05-07T20:25:57.1766864Z 
2025-05-07T20:25:57.1766868Z 
2025-05-07T20:25:57.1766871Z 
2025-05-07T20:25:57.1766875Z 
2025-05-07T20:25:57.1766886Z 
2025-05-07T20:25:57.1766890Z 
2025-05-07T20:25:57.1766893Z 
2025-05-07T20:25:57.1766897Z 
2025-05-07T20:25:57.1766901Z 
2025-05-07T20:25:57.1766904Z 
2025-05-07T20:25:57.1766908Z 
2025-05-07T20:25:57.1767097Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1767335Z 
2025-05-07T20:25:57.1767338Z 
2025-05-07T20:25:57.1767342Z 
2025-05-07T20:25:57.1767346Z 
2025-05-07T20:25:57.1767349Z 
2025-05-07T20:25:57.1767353Z 
2025-05-07T20:25:57.1767357Z 
2025-05-07T20:25:57.1767367Z 
2025-05-07T20:25:57.1767371Z 
2025-05-07T20:25:57.1767374Z 
2025-05-07T20:25:57.1767378Z 
2025-05-07T20:25:57.1767381Z 
2025-05-07T20:25:57.1767385Z 
2025-05-07T20:25:57.1767448Z 
2025-05-07T20:25:57.1767451Z 
2025-05-07T20:25:57.1767455Z 
2025-05-07T20:25:57.1767459Z 
2025-05-07T20:25:57.1767617Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1767823Z 
2025-05-07T20:25:57.1767826Z 
2025-05-07T20:25:57.1767833Z 
2025-05-07T20:25:57.1767838Z 
2025-05-07T20:25:57.1767844Z 
2025-05-07T20:25:57.1767855Z 
2025-05-07T20:25:57.1767861Z 
2025-05-07T20:25:57.1767866Z 
2025-05-07T20:25:57.1767871Z 
2025-05-07T20:25:57.1767876Z 
2025-05-07T20:25:57.1767881Z 
2025-05-07T20:25:57.1767886Z 
2025-05-07T20:25:57.1767891Z 
2025-05-07T20:25:57.1767896Z 
2025-05-07T20:25:57.1767901Z 
2025-05-07T20:25:57.1767912Z 
2025-05-07T20:25:57.1767917Z 
2025-05-07T20:25:57.1767922Z 
2025-05-07T20:25:57.1768134Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1768343Z 
2025-05-07T20:25:57.1768347Z 
2025-05-07T20:25:57.1768450Z [A
2025-05-07T20:25:57.1768591Z 
2025-05-07T20:25:57.1768597Z 
2025-05-07T20:25:57.1768738Z [A[A
2025-05-07T20:25:57.1768886Z 
2025-05-07T20:25:57.1768891Z 
2025-05-07T20:25:57.1768897Z 
2025-05-07T20:25:57.1769042Z [A[A[A
2025-05-07T20:25:57.1769186Z 
2025-05-07T20:25:57.1769192Z 
2025-05-07T20:25:57.1769205Z 
2025-05-07T20:25:57.1769210Z 
2025-05-07T20:25:57.1769351Z [A[A[A[A
2025-05-07T20:25:57.1769510Z 
2025-05-07T20:25:57.1769516Z 
2025-05-07T20:25:57.1769521Z 
2025-05-07T20:25:57.1769635Z 
2025-05-07T20:25:57.1769641Z 
2025-05-07T20:25:57.1769802Z [A[A[A[A[A
2025-05-07T20:25:57.1769983Z 
2025-05-07T20:25:57.1769990Z 
2025-05-07T20:25:57.1769996Z 
2025-05-07T20:25:57.1770003Z 
2025-05-07T20:25:57.1770010Z 
2025-05-07T20:25:57.1770106Z 
2025-05-07T20:25:57.1770276Z [A[A[A[A[A[A
2025-05-07T20:25:57.1770403Z 
2025-05-07T20:25:57.1770406Z 
2025-05-07T20:25:57.1770410Z 
2025-05-07T20:25:57.1770414Z 
2025-05-07T20:25:57.1770417Z 
2025-05-07T20:25:57.1770421Z 
2025-05-07T20:25:57.1770425Z 
2025-05-07T20:25:57.1770548Z [A[A[A[A[A[A[A
2025-05-07T20:25:57.1770684Z 
2025-05-07T20:25:57.1770687Z 
2025-05-07T20:25:57.1770691Z 
2025-05-07T20:25:57.1770695Z 
2025-05-07T20:25:57.1770698Z 
2025-05-07T20:25:57.1770702Z 
2025-05-07T20:25:57.1770706Z 
2025-05-07T20:25:57.1770709Z 
2025-05-07T20:25:57.1770833Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1770981Z 
2025-05-07T20:25:57.1770984Z 
2025-05-07T20:25:57.1770988Z 
2025-05-07T20:25:57.1770992Z 
2025-05-07T20:25:57.1771001Z 
2025-05-07T20:25:57.1771005Z 
2025-05-07T20:25:57.1771009Z 
2025-05-07T20:25:57.1771012Z 
2025-05-07T20:25:57.1771016Z 
2025-05-07T20:25:57.1771139Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1771291Z 
2025-05-07T20:25:57.1771295Z 
2025-05-07T20:25:57.1771298Z 
2025-05-07T20:25:57.1771307Z 
2025-05-07T20:25:57.1771311Z 
2025-05-07T20:25:57.1771315Z 
2025-05-07T20:25:57.1771318Z 
2025-05-07T20:25:57.1771322Z 
2025-05-07T20:25:57.1771325Z 
2025-05-07T20:25:57.1771336Z 
2025-05-07T20:25:57.1771459Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1771621Z 
2025-05-07T20:25:57.1771624Z 
2025-05-07T20:25:57.1771628Z 
2025-05-07T20:25:57.1771631Z 
2025-05-07T20:25:57.1771635Z 
2025-05-07T20:25:57.1771639Z 
2025-05-07T20:25:57.1771642Z 
2025-05-07T20:25:57.1771653Z 
2025-05-07T20:25:57.1771657Z 
2025-05-07T20:25:57.1771660Z 
2025-05-07T20:25:57.1771664Z 
2025-05-07T20:25:57.1771793Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1771964Z 
2025-05-07T20:25:57.1771968Z 
2025-05-07T20:25:57.1771975Z 
2025-05-07T20:25:57.1771985Z 
2025-05-07T20:25:57.1771989Z 
2025-05-07T20:25:57.1771993Z 
2025-05-07T20:25:57.1771996Z 
2025-05-07T20:25:57.1772000Z 
2025-05-07T20:25:57.1772004Z 
2025-05-07T20:25:57.1772007Z 
2025-05-07T20:25:57.1772011Z 
2025-05-07T20:25:57.1772015Z 
2025-05-07T20:25:57.1772150Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1772333Z 
2025-05-07T20:25:57.1772337Z 
2025-05-07T20:25:57.1772340Z 
2025-05-07T20:25:57.1772344Z 
2025-05-07T20:25:57.1772348Z 
2025-05-07T20:25:57.1772351Z 
2025-05-07T20:25:57.1772355Z 
2025-05-07T20:25:57.1772358Z 
2025-05-07T20:25:57.1772362Z 
2025-05-07T20:25:57.1772366Z 
2025-05-07T20:25:57.1772369Z 
2025-05-07T20:25:57.1772373Z 
2025-05-07T20:25:57.1772377Z 
2025-05-07T20:25:57.1772506Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1772695Z 
2025-05-07T20:25:57.1772699Z 
2025-05-07T20:25:57.1772703Z 
2025-05-07T20:25:57.1772706Z 
2025-05-07T20:25:57.1772710Z 
2025-05-07T20:25:57.1772714Z 
2025-05-07T20:25:57.1772717Z 
2025-05-07T20:25:57.1772725Z 
2025-05-07T20:25:57.1772728Z 
2025-05-07T20:25:57.1772732Z 
2025-05-07T20:25:57.1772736Z 
2025-05-07T20:25:57.1772739Z 
2025-05-07T20:25:57.1772743Z 
2025-05-07T20:25:57.1772746Z 
2025-05-07T20:25:57.1772889Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1773150Z 
2025-05-07T20:25:57.1773155Z 
2025-05-07T20:25:57.1773161Z 
2025-05-07T20:25:57.1773166Z 
2025-05-07T20:25:57.1773171Z 
2025-05-07T20:25:57.1773176Z 
2025-05-07T20:25:57.1773182Z 
2025-05-07T20:25:57.1773187Z 
2025-05-07T20:25:57.1773192Z 
2025-05-07T20:25:57.1773198Z 
2025-05-07T20:25:57.1773211Z 
2025-05-07T20:25:57.1773216Z 
2025-05-07T20:25:57.1773222Z 
2025-05-07T20:25:57.1773227Z 
2025-05-07T20:25:57.1773231Z 
2025-05-07T20:25:57.1773442Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1773712Z 
2025-05-07T20:25:57.1773717Z 
2025-05-07T20:25:57.1773730Z 
2025-05-07T20:25:57.1773735Z 
2025-05-07T20:25:57.1773740Z 
2025-05-07T20:25:57.1773745Z 
2025-05-07T20:25:57.1773751Z 
2025-05-07T20:25:57.1773932Z 
2025-05-07T20:25:57.1773938Z 
2025-05-07T20:25:57.1773943Z 
2025-05-07T20:25:57.1773949Z 
2025-05-07T20:25:57.1773954Z 
2025-05-07T20:25:57.1773959Z 
2025-05-07T20:25:57.1773964Z 
2025-05-07T20:25:57.1773969Z 
2025-05-07T20:25:57.1773975Z 
2025-05-07T20:25:57.1774280Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1774488Z 
2025-05-07T20:25:57.1774492Z 
2025-05-07T20:25:57.1774496Z 
2025-05-07T20:25:57.1774499Z 
2025-05-07T20:25:57.1774503Z 
2025-05-07T20:25:57.1774507Z 
2025-05-07T20:25:57.1774510Z 
2025-05-07T20:25:57.1774514Z 
2025-05-07T20:25:57.1774518Z 
2025-05-07T20:25:57.1774521Z 
2025-05-07T20:25:57.1774525Z 
2025-05-07T20:25:57.1774529Z 
2025-05-07T20:25:57.1774532Z 
2025-05-07T20:25:57.1774536Z 
2025-05-07T20:25:57.1774540Z 
2025-05-07T20:25:57.1774543Z 
2025-05-07T20:25:57.1774547Z 
2025-05-07T20:25:57.1774705Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1774906Z 
2025-05-07T20:25:57.1774909Z 
2025-05-07T20:25:57.1774913Z 
2025-05-07T20:25:57.1774923Z 
2025-05-07T20:25:57.1774926Z 
2025-05-07T20:25:57.1774930Z 
2025-05-07T20:25:57.1774934Z 
2025-05-07T20:25:57.1774947Z 
2025-05-07T20:25:57.1774953Z 
2025-05-07T20:25:57.1774958Z 
2025-05-07T20:25:57.1774962Z 
2025-05-07T20:25:57.1774965Z 
2025-05-07T20:25:57.1774975Z 
2025-05-07T20:25:57.1774978Z 
2025-05-07T20:25:57.1774982Z 
2025-05-07T20:25:57.1774986Z 
2025-05-07T20:25:57.1774989Z 
2025-05-07T20:25:57.1774993Z 
2025-05-07T20:25:57.1775155Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1775365Z 
2025-05-07T20:25:57.1775368Z 
2025-05-07T20:25:57.1775467Z [A
2025-05-07T20:25:57.1775569Z 
2025-05-07T20:25:57.1775572Z 
2025-05-07T20:25:57.1775678Z [A[A
2025-05-07T20:25:57.1775781Z 
2025-05-07T20:25:57.1775784Z 
2025-05-07T20:25:57.1775788Z 
2025-05-07T20:25:57.1775897Z [A[A[A
2025-05-07T20:25:57.1776001Z 
2025-05-07T20:25:57.1776005Z 
2025-05-07T20:25:57.1776009Z 
2025-05-07T20:25:57.1776012Z 
2025-05-07T20:25:57.1776113Z [A[A[A[A
2025-05-07T20:25:57.1776240Z 
2025-05-07T20:25:57.1776243Z 
2025-05-07T20:25:57.1776247Z 
2025-05-07T20:25:57.1776250Z 
2025-05-07T20:25:57.1776254Z 
2025-05-07T20:25:57.1776361Z [A[A[A[A[A
2025-05-07T20:25:57.1776487Z 
2025-05-07T20:25:57.1776491Z 
2025-05-07T20:25:57.1776494Z 
2025-05-07T20:25:57.1776502Z 
2025-05-07T20:25:57.1776506Z 
2025-05-07T20:25:57.1776509Z 
2025-05-07T20:25:57.1776616Z [A[A[A[A[A[A
2025-05-07T20:25:57.1776745Z 
2025-05-07T20:25:57.1776749Z 
2025-05-07T20:25:57.1776753Z 
2025-05-07T20:25:57.1776756Z 
2025-05-07T20:25:57.1776760Z 
2025-05-07T20:25:57.1776764Z 
2025-05-07T20:25:57.1776768Z 
2025-05-07T20:25:57.1776878Z [A[A[A[A[A[A[A
2025-05-07T20:25:57.1777018Z 
2025-05-07T20:25:57.1777021Z 
2025-05-07T20:25:57.1777025Z 
2025-05-07T20:25:57.1777029Z 
2025-05-07T20:25:57.1777032Z 
2025-05-07T20:25:57.1777036Z 
2025-05-07T20:25:57.1777040Z 
2025-05-07T20:25:57.1777043Z 
2025-05-07T20:25:57.1777158Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1777316Z 
2025-05-07T20:25:57.1777320Z 
2025-05-07T20:25:57.1777328Z 
2025-05-07T20:25:57.1777332Z 
2025-05-07T20:25:57.1777335Z 
2025-05-07T20:25:57.1777339Z 
2025-05-07T20:25:57.1777343Z 
2025-05-07T20:25:57.1777347Z 
2025-05-07T20:25:57.1777350Z 
2025-05-07T20:25:57.1777471Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1777635Z 
2025-05-07T20:25:57.1777638Z 
2025-05-07T20:25:57.1777642Z 
2025-05-07T20:25:57.1777646Z 
2025-05-07T20:25:57.1777649Z 
2025-05-07T20:25:57.1777653Z 
2025-05-07T20:25:57.1777657Z 
2025-05-07T20:25:57.1777660Z 
2025-05-07T20:25:57.1777664Z 
2025-05-07T20:25:57.1777668Z 
2025-05-07T20:25:57.1777789Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1777953Z 
2025-05-07T20:25:57.1777957Z 
2025-05-07T20:25:57.1777960Z 
2025-05-07T20:25:57.1777964Z 
2025-05-07T20:25:57.1777968Z 
2025-05-07T20:25:57.1777971Z 
2025-05-07T20:25:57.1777975Z 
2025-05-07T20:25:57.1777979Z 
2025-05-07T20:25:57.1777982Z 
2025-05-07T20:25:57.1777986Z 
2025-05-07T20:25:57.1777990Z 
2025-05-07T20:25:57.1778121Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1778381Z 
2025-05-07T20:25:57.1778386Z 
2025-05-07T20:25:57.1778389Z 
2025-05-07T20:25:57.1778393Z 
2025-05-07T20:25:57.1778396Z 
2025-05-07T20:25:57.1778400Z 
2025-05-07T20:25:57.1778404Z 
2025-05-07T20:25:57.1778407Z 
2025-05-07T20:25:57.1778411Z 
2025-05-07T20:25:57.1778484Z 
2025-05-07T20:25:57.1778487Z 
2025-05-07T20:25:57.1778491Z 
2025-05-07T20:25:57.1778634Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1778813Z 
2025-05-07T20:25:57.1778817Z 
2025-05-07T20:25:57.1778821Z 
2025-05-07T20:25:57.1778824Z 
2025-05-07T20:25:57.1778828Z 
2025-05-07T20:25:57.1778832Z 
2025-05-07T20:25:57.1778835Z 
2025-05-07T20:25:57.1778839Z 
2025-05-07T20:25:57.1778843Z 
2025-05-07T20:25:57.1778847Z 
2025-05-07T20:25:57.1778856Z 
2025-05-07T20:25:57.1778859Z 
2025-05-07T20:25:57.1778863Z 
2025-05-07T20:25:57.1778994Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1779178Z 
2025-05-07T20:25:57.1779181Z 
2025-05-07T20:25:57.1779185Z 
2025-05-07T20:25:57.1779189Z 
2025-05-07T20:25:57.1779198Z 
2025-05-07T20:25:57.1779207Z 
2025-05-07T20:25:57.1779211Z 
2025-05-07T20:25:57.1779215Z 
2025-05-07T20:25:57.1779218Z 
2025-05-07T20:25:57.1779222Z 
2025-05-07T20:25:57.1779226Z 
2025-05-07T20:25:57.1779230Z 
2025-05-07T20:25:57.1779233Z 
2025-05-07T20:25:57.1779237Z 
2025-05-07T20:25:57.1779380Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1779574Z 
2025-05-07T20:25:57.1779577Z 
2025-05-07T20:25:57.1779581Z 
2025-05-07T20:25:57.1779585Z 
2025-05-07T20:25:57.1779588Z 
2025-05-07T20:25:57.1779592Z 
2025-05-07T20:25:57.1779595Z 
2025-05-07T20:25:57.1779599Z 
2025-05-07T20:25:57.1779603Z 
2025-05-07T20:25:57.1779606Z 
2025-05-07T20:25:57.1779610Z 
2025-05-07T20:25:57.1779613Z 
2025-05-07T20:25:57.1779617Z 
2025-05-07T20:25:57.1779621Z 
2025-05-07T20:25:57.1779624Z 
2025-05-07T20:25:57.1779779Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1779970Z 
2025-05-07T20:25:57.1779974Z 
2025-05-07T20:25:57.1779977Z 
2025-05-07T20:25:57.1779981Z 
2025-05-07T20:25:57.1779984Z 
2025-05-07T20:25:57.1779993Z 
2025-05-07T20:25:57.1779997Z 
2025-05-07T20:25:57.1780000Z 
2025-05-07T20:25:57.1780004Z 
2025-05-07T20:25:57.1780007Z 
2025-05-07T20:25:57.1780011Z 
2025-05-07T20:25:57.1780015Z 
2025-05-07T20:25:57.1780018Z 
2025-05-07T20:25:57.1780028Z 
2025-05-07T20:25:57.1780035Z 
2025-05-07T20:25:57.1780039Z 
2025-05-07T20:25:57.1780188Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1780387Z 
2025-05-07T20:25:57.1780391Z 
2025-05-07T20:25:57.1780395Z 
2025-05-07T20:25:57.1780398Z 
2025-05-07T20:25:57.1780402Z 
2025-05-07T20:25:57.1780411Z 
2025-05-07T20:25:57.1780415Z 
2025-05-07T20:25:57.1780419Z 
2025-05-07T20:25:57.1780422Z 
2025-05-07T20:25:57.1780426Z 
2025-05-07T20:25:57.1780430Z 
2025-05-07T20:25:57.1780433Z 
2025-05-07T20:25:57.1780437Z 
2025-05-07T20:25:57.1780440Z 
2025-05-07T20:25:57.1780444Z 
2025-05-07T20:25:57.1780448Z 
2025-05-07T20:25:57.1780451Z 
2025-05-07T20:25:57.1780605Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1780812Z 
2025-05-07T20:25:57.1780820Z 
2025-05-07T20:25:57.1780823Z 
2025-05-07T20:25:57.1780827Z 
2025-05-07T20:25:57.1780831Z 
2025-05-07T20:25:57.1780834Z 
2025-05-07T20:25:57.1780838Z 
2025-05-07T20:25:57.1780842Z 
2025-05-07T20:25:57.1780845Z 
2025-05-07T20:25:57.1780849Z 
2025-05-07T20:25:57.1780855Z 
2025-05-07T20:25:57.1780859Z 
2025-05-07T20:25:57.1780863Z 
2025-05-07T20:25:57.1780866Z 
2025-05-07T20:25:57.1780870Z 
2025-05-07T20:25:57.1780874Z 
2025-05-07T20:25:57.1780877Z 
2025-05-07T20:25:57.1780881Z 
2025-05-07T20:25:57.1781046Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1781250Z 
2025-05-07T20:25:57.1781254Z 
2025-05-07T20:25:57.1781360Z [A
2025-05-07T20:25:57.1781459Z 
2025-05-07T20:25:57.1781463Z 
2025-05-07T20:25:57.1781561Z [A[A
2025-05-07T20:25:57.1781666Z 
2025-05-07T20:25:57.1781669Z 
2025-05-07T20:25:57.1781673Z 
2025-05-07T20:25:57.1781772Z [A[A[A
2025-05-07T20:25:57.1781876Z 
2025-05-07T20:25:57.1781885Z 
2025-05-07T20:25:57.1781889Z 
2025-05-07T20:25:57.1781974Z 
2025-05-07T20:25:57.1782077Z [A[A[A[A
2025-05-07T20:25:57.1782190Z 
2025-05-07T20:25:57.1782194Z 
2025-05-07T20:25:57.1782198Z 
2025-05-07T20:25:57.1782201Z 
2025-05-07T20:25:57.1782211Z 
2025-05-07T20:25:57.1782319Z [A[A[A[A[A
2025-05-07T20:25:57.1782444Z 
2025-05-07T20:25:57.1782523Z 
2025-05-07T20:25:57.1782526Z 
2025-05-07T20:25:57.1782530Z 
2025-05-07T20:25:57.1782534Z 
2025-05-07T20:25:57.1782537Z 
2025-05-07T20:25:57.1782674Z [A[A[A[A[A[A
2025-05-07T20:25:57.1782797Z 
2025-05-07T20:25:57.1782800Z 
2025-05-07T20:25:57.1782804Z 
2025-05-07T20:25:57.1782808Z 
2025-05-07T20:25:57.1782812Z 
2025-05-07T20:25:57.1782821Z 
2025-05-07T20:25:57.1782825Z 
2025-05-07T20:25:57.1782936Z [A[A[A[A[A[A[A
2025-05-07T20:25:57.1783070Z 
2025-05-07T20:25:57.1783073Z 
2025-05-07T20:25:57.1783077Z 
2025-05-07T20:25:57.1783080Z 
2025-05-07T20:25:57.1783084Z 
2025-05-07T20:25:57.1783088Z 
2025-05-07T20:25:57.1783098Z 
2025-05-07T20:25:57.1783101Z 
2025-05-07T20:25:57.1783224Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1783369Z 
2025-05-07T20:25:57.1783372Z 
2025-05-07T20:25:57.1783376Z 
2025-05-07T20:25:57.1783380Z 
2025-05-07T20:25:57.1783383Z 
2025-05-07T20:25:57.1783392Z 
2025-05-07T20:25:57.1783396Z 
2025-05-07T20:25:57.1783399Z 
2025-05-07T20:25:57.1783403Z 
2025-05-07T20:25:57.1783533Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1783687Z 
2025-05-07T20:25:57.1783690Z 
2025-05-07T20:25:57.1783694Z 
2025-05-07T20:25:57.1783697Z 
2025-05-07T20:25:57.1783707Z 
2025-05-07T20:25:57.1783710Z 
2025-05-07T20:25:57.1783714Z 
2025-05-07T20:25:57.1783717Z 
2025-05-07T20:25:57.1783721Z 
2025-05-07T20:25:57.1783725Z 
2025-05-07T20:25:57.1783848Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1784008Z 
2025-05-07T20:25:57.1784018Z 
2025-05-07T20:25:57.1784021Z 
2025-05-07T20:25:57.1784025Z 
2025-05-07T20:25:57.1784028Z 
2025-05-07T20:25:57.1784032Z 
2025-05-07T20:25:57.1784036Z 
2025-05-07T20:25:57.1784039Z 
2025-05-07T20:25:57.1784043Z 
2025-05-07T20:25:57.1784047Z 
2025-05-07T20:25:57.1784050Z 
2025-05-07T20:25:57.1784183Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.1784362Z 
2025-05-07T20:25:57.1784365Z 
2025-05-07T20:25:57.1784369Z 
2025-05-07T20:25:57.1784372Z 
2025-05-07T20:25:57.1784376Z 
2025-05-07T20:25:57.1784380Z 
2025-05-07T20:25:57.1784383Z 
2025-05-07T20:25:57.1784390Z 
2025-05-07T20:25:57.1784394Z 
2025-05-07T20:25:57.1784397Z 
2025-05-07T20:25:57.1784401Z 
2025-05-07T20:25:57.1784405Z 
2025-05-07T20:25:57.1784544Z [A[A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:25:57.4984980Z Preparing transaction: \ | / done
2025-05-07T20:25:58.9344690Z Verifying transaction: \ | / - \ | / - \ | / - \ | done
2025-05-07T20:25:59.6813667Z Executing transaction: - \ | / - \ | done
2025-05-07T20:26:02.0232851Z [INSTALL] Fixing file placements for CUDA 12.6.3+ ...
2025-05-07T20:26:02.0233410Z [INSTALL] Creating symlinks: libnvToolsExt.so
2025-05-07T20:26:02.0234226Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:02.0234787Z 
2025-05-07T20:26:02.0246693Z 
2025-05-07T20:26:02.0247713Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:02.0248456Z 
2025-05-07T20:26:02.0258944Z 
2025-05-07T20:26:02.0259260Z [INSTALL] Copying nvtx3 headers ...
2025-05-07T20:26:02.0265469Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/
2025-05-07T20:26:02.0269469Z 
2025-05-07T20:26:02.0534480Z 
2025-05-07T20:26:02.0540196Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/
2025-05-07T20:26:02.0544048Z 
2025-05-07T20:26:02.0561025Z 
2025-05-07T20:26:02.0561493Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ...
2025-05-07T20:26:02.0926909Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ...
2025-05-07T20:26:03.9759643Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error)
2025-05-07T20:26:04.0399627Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs
2025-05-07T20:26:04.0400173Z 
2025-05-07T20:26:04.4630876Z 
2025-05-07T20:26:04.4639807Z [INSTALL] Setting environment variable NVML_LIB_PATH ...
2025-05-07T20:26:04.4993445Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:04.4994156Z 
2025-05-07T20:26:04.9297917Z 
2025-05-07T20:26:04.9298308Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ...
2025-05-07T20:26:04.9299459Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/"
2025-05-07T20:26:04.9300179Z 
2025-05-07T20:26:05.3502974Z 
2025-05-07T20:26:07.3750678Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h
2025-05-07T20:26:09.4032334Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so
2025-05-07T20:26:11.4366828Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:11.4367645Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:13.4609461Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:15.3513102Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc
2025-05-07T20:26:15.3513384Z 
2025-05-07T20:26:15.4143411Z [CHECK] Binary nvcc found in PATH
2025-05-07T20:26:19.2862200Z /tmp/tmpdfxt0iiv: line 3: clang: command not found
2025-05-07T20:26:19.2862484Z 
2025-05-07T20:26:19.2863196Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error)
2025-05-07T20:26:19.3506307Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d
2025-05-07T20:26:19.3506638Z 
2025-05-07T20:26:19.3528670Z total 36
2025-05-07T20:26:19.3529072Z drwxr-xr-x. 2 ec2-user ec2-user   191 May  7 20:25 .
2025-05-07T20:26:19.3529468Z drwxr-xr-x. 5 ec2-user ec2-user    62 May  7 20:24 ..
2025-05-07T20:26:19.3530042Z -rw-r--r--. 2 ec2-user ec2-user  3778 Jun 10  2024 activate-binutils_linux-64.sh
2025-05-07T20:26:19.3530984Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10  2024 activate-gcc_linux-64.sh
2025-05-07T20:26:19.3531625Z -rw-r--r--. 2 ec2-user ec2-user  5190 Jun 10  2024 activate-gxx_linux-64.sh
2025-05-07T20:26:19.3532242Z -rw-r--r--. 2 ec2-user ec2-user   136 Mar 27 01:27 libglib_activate.sh
2025-05-07T20:26:19.3532713Z -rw-r--r--. 2 ec2-user ec2-user   872 Nov 13 09:20 libxml2_activate.sh
2025-05-07T20:26:19.3533157Z -rw-r--r--. 2 ec2-user ec2-user  2932 Nov 20 20:32 ~cuda-nvcc_activate.sh
2025-05-07T20:26:19.3533440Z 
2025-05-07T20:26:19.3533653Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ...
2025-05-07T20:26:19.3534280Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh
2025-05-07T20:26:19.3534709Z 
2025-05-07T20:26:19.3551766Z 
2025-05-07T20:26:19.3552215Z + conda run -n build_binary c++ --version | grep -i clang
2025-05-07T20:26:19.3552570Z 
2025-05-07T20:26:21.2956510Z 
2025-05-07T20:26:21.2957310Z [BUILD] Setting prepend flags for NVCC ...
2025-05-07T20:26:21.2958062Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler"
2025-05-07T20:26:21.2958465Z 
2025-05-07T20:26:21.7186734Z 
2025-05-07T20:26:21.7187079Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS
2025-05-07T20:26:21.7187430Z 
2025-05-07T20:26:23.6018989Z -allow-unsupported-compiler
2025-05-07T20:26:23.6019243Z 
2025-05-07T20:26:23.6639873Z 
2025-05-07T20:26:23.6640418Z [INFO] Printing out all preprocessor defines in nvcc ...
2025-05-07T20:26:23.6641105Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null
2025-05-07T20:26:23.6641554Z 
2025-05-07T20:26:25.6049247Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead")))
2025-05-07T20:26:25.6050039Z #define M_PIl 3.141592653589793238462643383279502884L
2025-05-07T20:26:25.6050440Z #define _IO_CURRENTLY_PUTTING 0x800
2025-05-07T20:26:25.6050761Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig))
2025-05-07T20:26:25.6051099Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:26:25.6051368Z #define _STL_PAIR_H 1
2025-05-07T20:26:25.6051612Z #define __cpp_attributes 200809L
2025-05-07T20:26:25.6051940Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:26:25.6052283Z #define __DELETE_THROW throw()
2025-05-07T20:26:25.6052535Z #define _PTRDIFF_T_ 
2025-05-07T20:26:25.6052783Z #define M_PI_4 0.78539816339744830962
2025-05-07T20:26:25.6053069Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:26:25.6053342Z #define _IO_LEFT 02
2025-05-07T20:26:25.6053620Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:26:25.6053969Z #define _POSIX2_BC_SCALE_MAX 99
2025-05-07T20:26:25.6054337Z #define _GLIBCXX_USE_RANDOM_TR1 1
2025-05-07T20:26:25.6054905Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp)
2025-05-07T20:26:25.6055483Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:26:25.6055869Z #define RE_DUP_MAX (0x7fff)
2025-05-07T20:26:25.6056208Z #define _IOS_OUTPUT 2
2025-05-07T20:26:25.6056543Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:26:25.6056905Z #define toascii_l(c,l) __toascii_l ((c), (l))
2025-05-07T20:26:25.6057203Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:26:25.6057536Z #define _GLIBCXX_USE_FCHMOD 1
2025-05-07T20:26:25.6057968Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:26:25.6058930Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; }))
2025-05-07T20:26:25.6060278Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:26:25.6060688Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:26:25.6061355Z #define cudaTextureTypeCubemapLayered 0xFC
2025-05-07T20:26:25.6061771Z #define _T_WCHAR_ 
2025-05-07T20:26:25.6062058Z #define stdout stdout
2025-05-07T20:26:25.6062497Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11")))
2025-05-07T20:26:25.6063189Z #define CHAR_BIT __CHAR_BIT__
2025-05-07T20:26:25.6063521Z #define __flexarr []
2025-05-07T20:26:25.6063842Z #define _GLIBCXX_HAVE_FINITEF 1
2025-05-07T20:26:25.6064272Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l))
2025-05-07T20:26:25.6064731Z #define _IO_FLAGS2_USER_WBUF 8
2025-05-07T20:26:25.6065068Z #define _MATH_H 1
2025-05-07T20:26:25.6065436Z #define cudaOccupancyDisableCachingOverride 0x01
2025-05-07T20:26:25.6065888Z #define __S64_TYPE long int
2025-05-07T20:26:25.6066215Z #define __stub_fchflags 
2025-05-07T20:26:25.6066609Z #define cudaDeviceScheduleMask 0x07
2025-05-07T20:26:25.6067005Z #define __SQUAD_TYPE long int
2025-05-07T20:26:25.6067352Z #define __INTMAX_C(c) c ## L
2025-05-07T20:26:25.6067711Z #define _BSD_SIZE_T_DEFINED_ 
2025-05-07T20:26:25.6067983Z #define NL_NMAX INT_MAX
2025-05-07T20:26:25.6068209Z #define _BITS_TIME_H 1
2025-05-07T20:26:25.6068485Z #define M_LN10l 2.302585092994045684017991454684364208L
2025-05-07T20:26:25.6068811Z #define _GLIBCXX_TXN_SAFE_DYN 
2025-05-07T20:26:25.6069115Z #define cudaStreamTailLaunch ((cudaStream_t)0x3)
2025-05-07T20:26:25.6069468Z #define M_El 2.718281828459045235360287471352662498L
2025-05-07T20:26:25.6069941Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd)
2025-05-07T20:26:25.6070312Z #define __CHAR_BIT__ 8
2025-05-07T20:26:25.6070566Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:25.6070879Z #define _PSTL_STRING_CONCAT(x,y) x #y
2025-05-07T20:26:25.6071173Z #define _GLIBCXX98_USE_C99_MATH 1
2025-05-07T20:26:25.6071434Z #define FP_NAN 0
2025-05-07T20:26:25.6071698Z #define makedev(maj,min) gnu_dev_makedev (maj, min)
2025-05-07T20:26:25.6072141Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 
2025-05-07T20:26:25.6072634Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2
2025-05-07T20:26:25.6073022Z #define __cudaCDP2GetErrorString 
2025-05-07T20:26:25.6073306Z #define SHRT_MAX __SHRT_MAX__
2025-05-07T20:26:25.6073560Z #define _GLIBCXX_X86_RDSEED 1
2025-05-07T20:26:25.6073816Z #define __SM_80_RT_H__ 
2025-05-07T20:26:25.6074040Z #define _NEW 
2025-05-07T20:26:25.6074256Z #define CLOCK_PROCESS_CPUTIME_ID 2
2025-05-07T20:26:25.6074535Z #define __UINT8_MAX__ 0xff
2025-05-07T20:26:25.6074904Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition)
2025-05-07T20:26:25.6075303Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:26:25.6075533Z #define __USE_ANSI 1
2025-05-07T20:26:25.6075813Z #define _IO_BE(expr,res) __builtin_expect ((expr), res)
2025-05-07T20:26:25.6076204Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l))
2025-05-07T20:26:25.6076551Z #define __cudaCDP2Memcpy2DAsync_ptsz 
2025-05-07T20:26:25.6076877Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:26:25.6077179Z #define __SIZEOF_PTHREAD_ATTR_T 56
2025-05-07T20:26:25.6077457Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:26:25.6077731Z #define _GLIBCXX_END_NAMESPACE_LDBL 
2025-05-07T20:26:25.6078010Z #define PIPE_BUF 4096
2025-05-07T20:26:25.6078321Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 
2025-05-07T20:26:25.6078686Z #define ADJ_TICK 0x4000
2025-05-07T20:26:25.6078961Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10)
2025-05-07T20:26:25.6079284Z #define MQ_PRIO_MAX 32768
2025-05-07T20:26:25.6079538Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4
2025-05-07T20:26:25.6079853Z #define __WAIT_INT(status) (*(int *) &(status))
2025-05-07T20:26:25.6080307Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:26:25.6080820Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01
2025-05-07T20:26:25.6081188Z #define _XOPEN_SOURCE 700
2025-05-07T20:26:25.6081442Z #define _POSIX2_BC_DIM_MAX 2048
2025-05-07T20:26:25.6081708Z #define __VECTOR_FUNCTIONS_HPP__ 
2025-05-07T20:26:25.6082090Z #define __cpp_static_assert 201411L
2025-05-07T20:26:25.6082535Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8)
2025-05-07T20:26:25.6082870Z #define _GLIBCXX_HAVE_STRXFRM_L 1
2025-05-07T20:26:25.6083146Z #define _POSIX_TTY_NAME_MAX 9
2025-05-07T20:26:25.6083515Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__
2025-05-07T20:26:25.6083816Z #define __OFF_T_MATCHES_OFF64_T 1
2025-05-07T20:26:25.6084092Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:26:25.6084393Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:25.6084747Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l))
2025-05-07T20:26:25.6085079Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:26:25.6085355Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1
2025-05-07T20:26:25.6085669Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:25.6086017Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l))
2025-05-07T20:26:25.6086367Z #define cudaNvSciSyncAttrSignal 0x1
2025-05-07T20:26:25.6086663Z #define _GLIBCXX_USE_LONG_LONG 1
2025-05-07T20:26:25.6086947Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:26:25.6087271Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:26:25.6087591Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:26:25.6087993Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:26:25.6088398Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:26:25.6088696Z #define ADJ_ESTERROR 0x0008
2025-05-07T20:26:25.6088963Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:26:25.6089233Z #define __GCC_IEC_559 2
2025-05-07T20:26:25.6089519Z #define __cpp_lib_transformation_trait_aliases 201304
2025-05-07T20:26:25.6089849Z #define _IO_flockfile(_fp) 
2025-05-07T20:26:25.6090100Z #define CLOCK_MONOTONIC_RAW 4
2025-05-07T20:26:25.6090362Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:26:25.6090622Z #define _IOFBF 0
2025-05-07T20:26:25.6090827Z #define __USE_BSD 1
2025-05-07T20:26:25.6091053Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:26:25.6091327Z #define SHRT_MIN (-SHRT_MAX - 1)
2025-05-07T20:26:25.6091609Z #define _IO_USER_LOCK 0x8000
2025-05-07T20:26:25.6091851Z #define _IO_NO_WRITES 8
2025-05-07T20:26:25.6092100Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 
2025-05-07T20:26:25.6092448Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname
2025-05-07T20:26:25.6093026Z #define _GLIBCXX_HAVE_SYS_STAT_H 1
2025-05-07T20:26:25.6093328Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ())
2025-05-07T20:26:25.6093646Z #define __cpp_binary_literals 201304L
2025-05-07T20:26:25.6093924Z #define _CPP_TYPE_TRAITS_H 1
2025-05-07T20:26:25.6094185Z #define __BEGIN_NAMESPACE_C99 
2025-05-07T20:26:25.6094448Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:26:25.6094752Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 
2025-05-07T20:26:25.6095134Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE)
2025-05-07T20:26:25.6095489Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:26:25.6095790Z #define M_PI 3.14159265358979323846
2025-05-07T20:26:25.6096090Z #define _GLIBCXX_PACKAGE_NAME "package-unused"
2025-05-07T20:26:25.6096416Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1
2025-05-07T20:26:25.6096755Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:26:25.6097051Z #define _POSIX_DELAYTIMER_MAX 32
2025-05-07T20:26:25.6097329Z #define _GLIBCXX_USE_UTIME 1
2025-05-07T20:26:25.6097596Z #define _STL_ITERATOR_BASE_FUNCS_H 1
2025-05-07T20:26:25.6098168Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr)
2025-05-07T20:26:25.6098747Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1
2025-05-07T20:26:25.6099067Z #define w_termsig __wait_terminated.__w_termsig
2025-05-07T20:26:25.6099388Z #define __FLOAT_WORD_ORDER __BYTE_ORDER
2025-05-07T20:26:25.6099677Z #define __cudaCDP2GetErrorName 
2025-05-07T20:26:25.6099950Z #define XATTR_SIZE_MAX 65536
2025-05-07T20:26:25.6100216Z #define be64toh(x) __bswap_64 (x)
2025-05-07T20:26:25.6100600Z #define __ASSERT_VOID_CAST static_cast<void>
2025-05-07T20:26:25.6100929Z #define __cpp_variadic_templates 200704L
2025-05-07T20:26:25.6101226Z #define RAND_MAX 2147483647
2025-05-07T20:26:25.6101483Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1
2025-05-07T20:26:25.6101811Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:25.6102206Z #define __SM_90_RT_H__ 
2025-05-07T20:26:25.6112613Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:26:25.6112931Z #define __COMPAR_FN_T 
2025-05-07T20:26:25.6113189Z #define __GID_T_TYPE __U32_TYPE
2025-05-07T20:26:25.6113465Z #define _IO_BAD_SEEN 0x4000
2025-05-07T20:26:25.6113944Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x)))
2025-05-07T20:26:25.6114465Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:26:25.6114816Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 
2025-05-07T20:26:25.6115176Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:26:25.6115483Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 
2025-05-07T20:26:25.6115830Z #define cudaArrayColorAttachment 0x20
2025-05-07T20:26:25.6116144Z #define __cpp_variable_templates 201304L
2025-05-07T20:26:25.6116658Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:26:25.6117210Z #define __cpp_lib_integral_constant_callable 201304
2025-05-07T20:26:25.6117554Z #define _GLIBCXX_HAVE_SINHF 1
2025-05-07T20:26:25.6117827Z #define MOD_TIMECONST ADJ_TIMECONST
2025-05-07T20:26:25.6118130Z #define __cpp_lib_result_of_sfinae 201210
2025-05-07T20:26:25.6118438Z #define __SM_30_INTRINSICS_H__ 
2025-05-07T20:26:25.6118704Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:26:25.6118980Z #define _GLIBCXX_USE_WCHAR_T 1
2025-05-07T20:26:25.6119247Z #define _GLIBCXX_MATH_H 1
2025-05-07T20:26:25.6119499Z #define __u_char_defined 
2025-05-07T20:26:25.6119815Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status))
2025-05-07T20:26:25.6120171Z #define STA_PPSERROR 0x0800
2025-05-07T20:26:25.6120424Z #define _GLIBCXX_STD_A std
2025-05-07T20:26:25.6120683Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:26:25.6120968Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 
2025-05-07T20:26:25.6121409Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type)
2025-05-07T20:26:25.6121825Z #define FP_INFINITE 1
2025-05-07T20:26:25.6122198Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:26:25.6122612Z #define _IO_pid_t __pid_t
2025-05-07T20:26:25.6122860Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:26:25.6123119Z #define __LEAF , __leaf__
2025-05-07T20:26:25.6123358Z #define PATH_MAX 4096
2025-05-07T20:26:25.6123615Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:26:25.6123944Z #define __LDBL_REDIR1(name,proto,alias) name proto
2025-05-07T20:26:25.6124265Z #define _LIMITS_H___ 
2025-05-07T20:26:25.6124496Z #define __size_t 
2025-05-07T20:26:25.6124727Z #define _GLIBCXX_HAVE_FREXPF 1
2025-05-07T20:26:25.6125265Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK)
2025-05-07T20:26:25.6125823Z #define _GLIBCXX_HAVE_FREXPL 1
2025-05-07T20:26:25.6126126Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:26:25.6126462Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:26:25.6126751Z #define _WCHAR_T_DEFINED 
2025-05-07T20:26:25.6127105Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 
2025-05-07T20:26:25.6127488Z #define MOD_STATUS ADJ_STATUS
2025-05-07T20:26:25.6127791Z #define _GLIBCXX_PURE __attribute__ ((__pure__))
2025-05-07T20:26:25.6128114Z #define _GLIBCXX_HAVE_STDINT_H 1
2025-05-07T20:26:25.6128384Z #define __SIZEOF_PTHREAD_CONDATTR_T 4
2025-05-07T20:26:25.6128660Z #define __INT8_C(c) c
2025-05-07T20:26:25.6128918Z #define __cudaCDP2GetParameterBuffer 
2025-05-07T20:26:25.6129210Z #define _GLIBCXX_HAVE_COSHF 1
2025-05-07T20:26:25.6129462Z #define _GLIBCXX_HAVE_COSHL 1
2025-05-07T20:26:25.6129711Z #define __SM_70_RT_HPP__ 
2025-05-07T20:26:25.6129952Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:26:25.6130390Z #define __cpp_variadic_using 201611L
2025-05-07T20:26:25.6130719Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:25.6131037Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:26:25.6131302Z #define __SM_61_INTRINSICS_HPP__ 
2025-05-07T20:26:25.6131720Z #define _IO_FLAGS2_MMAP 1
2025-05-07T20:26:25.6131975Z #define __cpp_capture_star_this 201603L
2025-05-07T20:26:25.6132272Z #define __cudaCDP2LaunchDeviceV2_ptsz 
2025-05-07T20:26:25.6132696Z #define _GLIBCXX_HAVE_ENDIAN_H 1
2025-05-07T20:26:25.6133058Z #define __always_inline __inline __attribute__ ((__always_inline__))
2025-05-07T20:26:25.6133427Z #define NFDBITS __NFDBITS
2025-05-07T20:26:25.6133678Z #define _PSTL_PRAGMA_FORCEINLINE 
2025-05-07T20:26:25.6133961Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1
2025-05-07T20:26:25.6134273Z #define __glibcxx_requires_sorted(_First,_Last) 
2025-05-07T20:26:25.6134576Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:26:25.6134829Z #define _GLIBCXX_SYMVER_GNU 1
2025-05-07T20:26:25.6135123Z #define w_stopval __wait_stopped.__w_stopval
2025-05-07T20:26:25.6135413Z #define STA_UNSYNC 0x0040
2025-05-07T20:26:25.6135719Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:26:25.6136125Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX
2025-05-07T20:26:25.6136575Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:26:25.6136881Z #define __cpp_if_constexpr 201606L
2025-05-07T20:26:25.6137221Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 
2025-05-07T20:26:25.6137584Z #define cudaStreamFireAndForget ((cudaStream_t)0x4)
2025-05-07T20:26:25.6137909Z #define _GLIBCXX_HAVE_WCHAR_H 1
2025-05-07T20:26:25.6138215Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO
2025-05-07T20:26:25.6138537Z #define __daddr_t_defined 
2025-05-07T20:26:25.6138776Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:26:25.6139050Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1
2025-05-07T20:26:25.6139360Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1
2025-05-07T20:26:25.6139860Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800))
2025-05-07T20:26:25.6140341Z #define _ACRTIMP 
2025-05-07T20:26:25.6140648Z #define _IO_EOF_SEEN 0x10
2025-05-07T20:26:25.6140955Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1
2025-05-07T20:26:25.6141249Z #define _IOS_BIN 128
2025-05-07T20:26:25.6141595Z #define __fortify_function __extern_always_inline __attribute_artificial__
2025-05-07T20:26:25.6142000Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:26:25.6142258Z #define UNDERFLOW 4
2025-05-07T20:26:25.6142473Z #define NAME_MAX 255
2025-05-07T20:26:25.6142706Z #define SCHAR_MAX __SCHAR_MAX__
2025-05-07T20:26:25.6142964Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:26:25.6143251Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:26:25.6143538Z #define _IO_UNIFIED_JUMPTABLES 1
2025-05-07T20:26:25.6143900Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:26:25.6144278Z #define __ptr_t void *
2025-05-07T20:26:25.6144515Z #define M_E 2.7182818284590452354
2025-05-07T20:26:25.6144779Z #define cudaSurfaceType1D 0x01
2025-05-07T20:26:25.6145044Z #define __USE_ISOCXX11 1
2025-05-07T20:26:25.6145300Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:26:25.6145605Z #define cudaDeviceBlockingSync 0x04
2025-05-07T20:26:25.6145899Z #define CLOCK_MONOTONIC_COARSE 6
2025-05-07T20:26:25.6146163Z #define _GLIBCXX_OS_DEFINES 1
2025-05-07T20:26:25.6146452Z #define _GLIBCXX_NODISCARD [[__nodiscard__]]
2025-05-07T20:26:25.6146753Z #define cudaSurfaceType2D 0x02
2025-05-07T20:26:25.6147009Z #define __linux 1
2025-05-07T20:26:25.6147233Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:26:25.6147494Z #define cudaDeviceMask 0xff
2025-05-07T20:26:25.6147756Z #define _GLIBCXX_END_NAMESPACE_ALGO 
2025-05-07T20:26:25.6148042Z #define __CUDA_API_VER_MAJOR__ 12
2025-05-07T20:26:25.6148310Z #define htobe16(x) __bswap_16 (x)
2025-05-07T20:26:25.6148593Z #define HUGE_VALF (__builtin_huge_valf())
2025-05-07T20:26:25.6148991Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:26:25.6149284Z #define HUGE_VALL (__builtin_huge_vall())
2025-05-07T20:26:25.6149565Z #define _BITS_TYPES_H 1
2025-05-07T20:26:25.6149915Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL)
2025-05-07T20:26:25.6150241Z #define _IO_cleanup_region_end(_Doit) 
2025-05-07T20:26:25.6150621Z #define cudaSurfaceType3D 0x03
2025-05-07T20:26:25.6150893Z #define _GLIBCXX_HAVE_SYS_TIME_H 1
2025-05-07T20:26:25.6151169Z #define __cudaGet_blockIdx() blockIdx
2025-05-07T20:26:25.6151445Z #define _IO_DONT_CLOSE 0100000
2025-05-07T20:26:25.6152213Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib)
2025-05-07T20:26:25.6153003Z #define cudaHostRegisterDefault 0x00
2025-05-07T20:26:25.6153271Z #define __unix 1
2025-05-07T20:26:25.6153480Z #define MATH_ERRNO 1
2025-05-07T20:26:25.6153716Z #define _GLIBCXX_STDIO_SEEK_END 2
2025-05-07T20:26:25.6153989Z #define _GLIBCXX_USE_FCHMODAT 1
2025-05-07T20:26:25.6154252Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:26:25.6154526Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:26:25.6154807Z #define __UID_T_TYPE __U32_TYPE
2025-05-07T20:26:25.6155077Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1
2025-05-07T20:26:25.6155534Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10))
2025-05-07T20:26:25.6155990Z #define __nv_pure__ __location__(nv_pure)
2025-05-07T20:26:25.6156272Z #define CUDARTAPI_CDECL 
2025-05-07T20:26:25.6156567Z #define _PSTL_USAGE_WARNINGS 0
2025-05-07T20:26:25.6156843Z #define _GLIBCXX98_USE_C99_COMPLEX 1
2025-05-07T20:26:25.6157116Z #define __cpp_lib_void_t 201411
2025-05-07T20:26:25.6157372Z #define _POSIX_AIO_MAX 1
2025-05-07T20:26:25.6157604Z #define __SIZE_T 
2025-05-07T20:26:25.6157842Z #define isgraph_l(c,l) __isgraph_l ((c), (l))
2025-05-07T20:26:25.6158155Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0
2025-05-07T20:26:25.6158446Z #define _POSIX_PIPE_BUF 512
2025-05-07T20:26:25.6158698Z #define _GLIBCXX_HAVE_STRTOLD 1
2025-05-07T20:26:25.6158949Z #define _ATFILE_SOURCE 1
2025-05-07T20:26:25.6159328Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false)
2025-05-07T20:26:25.6159744Z #define __WAIT_STATUS void *
2025-05-07T20:26:25.6159999Z #define __MATH_FUNCTIONS_H__ 
2025-05-07T20:26:25.6160257Z #define _GLIBCXX_HAVE_WCSTOF 1
2025-05-07T20:26:25.6160516Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:26:25.6160788Z #define _GLIBCXX_HAVE_LC_MESSAGES 1
2025-05-07T20:26:25.6161056Z #define __WINT_MIN__ 0U
2025-05-07T20:26:25.6161623Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L)
2025-05-07T20:26:25.6162249Z #define isdigit_l(c,l) __isdigit_l ((c), (l))
2025-05-07T20:26:25.6162541Z #define WUNTRACED 2
2025-05-07T20:26:25.6162763Z #define _GLIBCXX_HAVE_SQRTF 1
2025-05-07T20:26:25.6163038Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8
2025-05-07T20:26:25.6163305Z #define NZERO 20
2025-05-07T20:26:25.6163531Z #define _GLIBCXX_HAVE_MEMALIGN 1
2025-05-07T20:26:25.6163807Z #define _PSTL_PRAGMA(x) _Pragma(#x)
2025-05-07T20:26:25.6164083Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT
2025-05-07T20:26:25.6164364Z #define MOD_CLKB ADJ_TICK
2025-05-07T20:26:25.6164618Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:26:25.6164890Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:26:25.6165156Z #define __DEVICE_FUNCTIONS_H__ 
2025-05-07T20:26:25.6165427Z #define SCHAR_MIN (-SCHAR_MAX - 1)
2025-05-07T20:26:25.6165807Z #define EXIT_FAILURE 1
2025-05-07T20:26:25.6166049Z #define ADJ_MAXERROR 0x0004
2025-05-07T20:26:25.6166310Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:26:25.6166572Z #define _SIZE_T_DEFINED_ 
2025-05-07T20:26:25.6166866Z #define _POSIX_AIO_LISTIO_MAX 2
2025-05-07T20:26:25.6167149Z #define __cudaCDP2DeviceGetLimit 
2025-05-07T20:26:25.6167481Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW
2025-05-07T20:26:25.6167918Z #define __cudaCDP2FuncGetAttributes 
2025-05-07T20:26:25.6168213Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:26:25.6168463Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:26:25.6168726Z #define __USING_NAMESPACE_STD(name) 
2025-05-07T20:26:25.6169017Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1
2025-05-07T20:26:25.6169399Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:26:25.6169680Z #define SEEK_DATA 3
2025-05-07T20:26:25.6169911Z #define __KERNEL_STRICT_NAMES 
2025-05-07T20:26:25.6170211Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_))
2025-05-07T20:26:25.6170622Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0)
2025-05-07T20:26:25.6171010Z #define _FUNCTEXCEPT_H 1
2025-05-07T20:26:25.6171259Z #define __INT64_C(c) c ## L
2025-05-07T20:26:25.6171528Z #define __NTH(fct) __LEAF_ATTR fct throw ()
2025-05-07T20:26:25.6171852Z #define _GLIBCXX_CONST __attribute__ ((__const__))
2025-05-07T20:26:25.6172172Z #define _GLIBCXX_HAVE_LINK 1
2025-05-07T20:26:25.6172452Z #define cudaNvSciSyncAttrWait 0x2
2025-05-07T20:26:25.6172749Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:26:25.6173052Z #define STA_PPSWANDER 0x0400
2025-05-07T20:26:25.6173304Z #define __INT_WCHAR_T_H 
2025-05-07T20:26:25.6173543Z #define WSTOPPED 2
2025-05-07T20:26:25.6173780Z #define _POSIX_THREAD_THREADS_MAX 64
2025-05-07T20:26:25.6174064Z #define _POSIX_MQ_OPEN_MAX 8
2025-05-07T20:26:25.6174317Z #define FP_NORMAL 4
2025-05-07T20:26:25.6174559Z #define __cudaCDP2LaunchDevice_ptsz 
2025-05-07T20:26:25.6174832Z #define _BITS_TIMEX_H 1
2025-05-07T20:26:25.6175075Z #define _POSIX_LINK_MAX 8
2025-05-07T20:26:25.6175334Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1
2025-05-07T20:26:25.6175608Z #define _GLIBCXX_HAVE_ATAN2F 1
2025-05-07T20:26:25.6175883Z #define cudaTextureType1D 0x01
2025-05-07T20:26:25.6176160Z #define _GLIBCXX_HAVE_ATAN2L 1
2025-05-07T20:26:25.6176422Z #define COLL_WEIGHTS_MAX 255
2025-05-07T20:26:25.6176684Z #define __isascii(c) (((c) & ~0x7f) == 0)
2025-05-07T20:26:25.6176977Z #define __toascii(c) ((c) & 0x7f)
2025-05-07T20:26:25.6177409Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b)))
2025-05-07T20:26:25.6177851Z #define _IO_MAGIC 0xFBAD0000
2025-05-07T20:26:25.6178116Z #define _GLIBCXX_USE_SENDFILE 1
2025-05-07T20:26:25.6178373Z #define _POSIX_SOURCE 1
2025-05-07T20:26:25.6178622Z #define cudaTextureType2D 0x02
2025-05-07T20:26:25.6178884Z #define _PTR_TRAITS_H 1
2025-05-07T20:26:25.6179153Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE)
2025-05-07T20:26:25.6179460Z #define _GLIBCXX_HAVE_POWF 1
2025-05-07T20:26:25.6179725Z #define _POSIX2_BC_STRING_MAX 1000
2025-05-07T20:26:25.6180044Z #define __attribute_used__ __attribute__ ((__used__))
2025-05-07T20:26:25.6180370Z #define cudaTextureType3D 0x03
2025-05-07T20:26:25.6180645Z #define _STDIO_USES_IOSTREAM 
2025-05-07T20:26:25.6180901Z #define CLOCK_REALTIME 0
2025-05-07T20:26:25.6181148Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:26:25.6181416Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:26:25.6181725Z #define __cpp_aligned_new 201606L
2025-05-07T20:26:25.6182003Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:26:25.6182278Z #define cudaEventBlockingSync 0x01
2025-05-07T20:26:25.6182562Z #define _GLIBCXX_HAVE_TANL 1
2025-05-07T20:26:25.6182831Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1
2025-05-07T20:26:25.6183131Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1
2025-05-07T20:26:25.6183424Z #define _GLIBCXX_USE_C99_FENV_TR1 1
2025-05-07T20:26:25.6183700Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:26:25.6183944Z #define __GLIBC__ 2
2025-05-07T20:26:25.6184160Z #define __END_DECLS }
2025-05-07T20:26:25.6184402Z #define FP_ILOGB0 (-2147483647 - 1)
2025-05-07T20:26:25.6184756Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:26:25.6185129Z #define __CONCAT(x,y) x ## y
2025-05-07T20:26:25.6185381Z #define WCONTINUED 8
2025-05-07T20:26:25.6185613Z #define __STDC_HOSTED__ 1
2025-05-07T20:26:25.6185863Z #define _GLIBCXX_HAVE_ARPA_INET_H 1
2025-05-07T20:26:25.6186140Z #define _ALLOCA_H 1
2025-05-07T20:26:25.6186481Z #define __host__ __location__(host)
2025-05-07T20:26:25.6186922Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg)))
2025-05-07T20:26:25.6187352Z #define __SLONG32_TYPE int
2025-05-07T20:26:25.6187619Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1
2025-05-07T20:26:25.6187980Z #define _SYS_SELECT_H 1
2025-05-07T20:26:25.6188219Z #define _IO_LINE_BUF 0x200
2025-05-07T20:26:25.6188468Z #define _IOS_NOCREATE 32
2025-05-07T20:26:25.6188710Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:26:25.6188988Z #define __cudaGet_warpSize() warpSize
2025-05-07T20:26:25.6189280Z #define __SSIZE_T_TYPE __SWORD_TYPE
2025-05-07T20:26:25.6189557Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0
2025-05-07T20:26:25.6189839Z #define __global__ __location__(global)
2025-05-07T20:26:25.6190210Z #define __GNU_LIBRARY__ 6
2025-05-07T20:26:25.6190471Z #define __cpp_decltype_auto 201304L
2025-05-07T20:26:25.6190738Z #define __DBL_DIG__ 15
2025-05-07T20:26:25.6190966Z #define TIME_UTC 1
2025-05-07T20:26:25.6191289Z #define __FLT32_DIG__ 6
2025-05-07T20:26:25.6191605Z #define __forceinline__ __inline__ __attribute__((always_inline))
2025-05-07T20:26:25.6191998Z #define cudaHostAllocWriteCombined 0x04
2025-05-07T20:26:25.6192312Z #define cudaDeviceScheduleAuto 0x00
2025-05-07T20:26:25.6192620Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l))
2025-05-07T20:26:25.6192921Z #define _G_BUFSIZ 8192
2025-05-07T20:26:25.6193223Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:26:25.6193579Z #define cudaTextureTypeCubemap 0x0C
2025-05-07T20:26:25.6193873Z #define __cudaCDP2GetDevice 
2025-05-07T20:26:25.6194152Z #define __cudaCDP2PeekAtLastError 
2025-05-07T20:26:25.6194437Z #define STA_CLOCKERR 0x1000
2025-05-07T20:26:25.6194677Z #define __GXX_WEAK__ 1
2025-05-07T20:26:25.6194929Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:25.6195331Z #define _GLIBCXX_HAVE_ISNANF 1
2025-05-07T20:26:25.6195596Z #define __SHRT_WIDTH__ 16
2025-05-07T20:26:25.6195899Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304
2025-05-07T20:26:25.6196236Z #define _GLIBCXX_BITS_SPECFUN_H 1
2025-05-07T20:26:25.6196511Z #define _GLIBCXX_HAVE_ISNANL 1
2025-05-07T20:26:25.6196805Z #define isblank_l(c,l) __isblank_l ((c), (l))
2025-05-07T20:26:25.6197143Z #define _G_config_h 1
2025-05-07T20:26:25.6197424Z #define M_LOG2El 1.442695040888963407359924681001892137L
2025-05-07T20:26:25.6197755Z #define ADJ_OFFSET_SINGLESHOT 0x8001
2025-05-07T20:26:25.6198033Z #define _GCC_WCHAR_T 
2025-05-07T20:26:25.6198255Z #define TMP_MAX 238328
2025-05-07T20:26:25.6198495Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:26:25.6198761Z #define __DEVICE_TYPES_H__ 
2025-05-07T20:26:25.6199022Z #define __DEV_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:25.6199287Z #define _EXT_NUMERIC_TRAITS 1
2025-05-07T20:26:25.6199559Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 
2025-05-07T20:26:25.6199850Z #define _IO_SKIPWS 01
2025-05-07T20:26:25.6200258Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000
2025-05-07T20:26:25.6200719Z #define _IO_SCIENTIFIC 04000
2025-05-07T20:26:25.6200984Z #define _GLIBCXX_HAVE_STRING_H 1
2025-05-07T20:26:25.6201309Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:26:25.6201674Z #define cudaDeviceScheduleSpin 0x01
2025-05-07T20:26:25.6202041Z #define __nonnull(params) __attribute__ ((__nonnull__ params))
2025-05-07T20:26:25.6202393Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:26:25.6202650Z #define le32toh(x) (x)
2025-05-07T20:26:25.6202885Z #define _SIZE_T_DEFINED 
2025-05-07T20:26:25.6203133Z #define _GLIBCXX_HAVE_XLOCALE_H 1
2025-05-07T20:26:25.6203476Z #define cudaArraySparsePropertiesSingleMipTail 0x1
2025-05-07T20:26:25.6204056Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:26:25.6204463Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0)
2025-05-07T20:26:25.6204870Z #define _GLIBCXX_HAVE_FMODL 1
2025-05-07T20:26:25.6205137Z #define _GLIBCXX_HAVE_POLL 1
2025-05-07T20:26:25.6205399Z #define __SM_32_INTRINSICS_H__ 
2025-05-07T20:26:25.6205833Z #define _POSIX_NAME_MAX 14
2025-05-07T20:26:25.6206118Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:26:25.6206634Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter)
2025-05-07T20:26:25.6207126Z #define _GLIBCXX_USE_CLOCK_REALTIME 1
2025-05-07T20:26:25.6207546Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:26:25.6207896Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG)
2025-05-07T20:26:25.6208209Z #define _WCHAR_T_ 
2025-05-07T20:26:25.6208429Z #define _GLIBCXX_FAST_MATH 0
2025-05-07T20:26:25.6208789Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:26:25.6209169Z #define RTSIG_MAX 32
2025-05-07T20:26:25.6209385Z #define _STDDEF_H 
2025-05-07T20:26:25.6209614Z #define CU_UUID_HAS_BEEN_DEFINED 
2025-05-07T20:26:25.6209886Z #define _VA_LIST_DEFINED 
2025-05-07T20:26:25.6210138Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:26:25.6210473Z #define __glibcxx_requires_non_empty_range(_First,_Last) 
2025-05-07T20:26:25.6210852Z #define __grid_constant__ __location__(grid_constant)
2025-05-07T20:26:25.6211167Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:26:25.6211450Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" {
2025-05-07T20:26:25.6211900Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L)
2025-05-07T20:26:25.6212422Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B))
2025-05-07T20:26:25.6212777Z #define __SIZEOF_PTHREAD_COND_T 48
2025-05-07T20:26:25.6213093Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 
2025-05-07T20:26:25.6213397Z #define __unix__ 1
2025-05-07T20:26:25.6213618Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:25.6213907Z #define __INT_WIDTH__ 32
2025-05-07T20:26:25.6214144Z #define __SIZEOF_LONG__ 8
2025-05-07T20:26:25.6214368Z #define _IONBF 2
2025-05-07T20:26:25.6214805Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib)
2025-05-07T20:26:25.6215565Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++)
2025-05-07T20:26:25.6216086Z #define __STDC_IEC_559__ 1
2025-05-07T20:26:25.6216329Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:26:25.6216598Z #define __UINT16_C(c) c
2025-05-07T20:26:25.6216840Z #define M_2_PI 0.63661977236758134308
2025-05-07T20:26:25.6217102Z #define STA_DEL 0x0020
2025-05-07T20:26:25.6217339Z #define __CUDACC_VER_MINOR__ 6
2025-05-07T20:26:25.6217589Z #define __id_t_defined 
2025-05-07T20:26:25.6217849Z #define w_retcode __wait_terminated.__w_retcode
2025-05-07T20:26:25.6218291Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base)
2025-05-07T20:26:25.6218714Z #define _GLIBCXX_HAVE_MODFF 1
2025-05-07T20:26:25.6218970Z #define _GLIBCXX_HAVE_MODFL 1
2025-05-07T20:26:25.6219229Z #define __DECIMAL_DIG__ 21
2025-05-07T20:26:25.6219484Z #define _POSIX2_RE_DUP_MAX 255
2025-05-07T20:26:25.6219747Z #define __USE_FORTIFY_LEVEL 0
2025-05-07T20:26:25.6220008Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:26:25.6220270Z #define SING 2
2025-05-07T20:26:25.6220487Z #define STA_FREQHOLD 0x0080
2025-05-07T20:26:25.6220747Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:25.6221040Z #define cudaStreamDefault 0x00
2025-05-07T20:26:25.6221389Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:26:25.6221751Z #define _GLIBCXX_HAVE_HYPOTL 1
2025-05-07T20:26:25.6222014Z #define _GLIBCXX_HAVE_SYS_UIO_H 1
2025-05-07T20:26:25.6222275Z #define __gnu_linux__ 1
2025-05-07T20:26:25.6222502Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:26:25.6222751Z #define _LARGEFILE_SOURCE 1
2025-05-07T20:26:25.6222992Z #define MAX_INPUT 255
2025-05-07T20:26:25.6223220Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:26:25.6223537Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l))
2025-05-07T20:26:25.6223900Z #define __glibcxx_requires_heap(_First,_Last) 
2025-05-07T20:26:25.6224295Z #define _GLIBCXX_CPU_DEFINES 1
2025-05-07T20:26:25.6224719Z #define _GLIBCXX_HAVE_POLL_H 1
2025-05-07T20:26:25.6225125Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__))
2025-05-07T20:26:25.6225543Z #define _IO_SHOWPOS 02000
2025-05-07T20:26:25.6225867Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1
2025-05-07T20:26:25.6226311Z #define _Mfloat_ float
2025-05-07T20:26:25.6226584Z #define __glibcxx_requires_cond(_Cond,_Msg) 
2025-05-07T20:26:25.6226921Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:26:25.6227204Z #define DELAYTIMER_MAX 2147483647
2025-05-07T20:26:25.6227695Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0)
2025-05-07T20:26:25.6228176Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:26:25.6228438Z #define _GLIBCXX98_USE_C99_STDIO 1
2025-05-07T20:26:25.6228759Z #define cudaKernelNodeAttrID cudaLaunchAttributeID
2025-05-07T20:26:25.6229109Z #define __glibcxx_class_requires2(_a,_b,_c) 
2025-05-07T20:26:25.6229396Z #define __USE_ISOC11 1
2025-05-07T20:26:25.6229622Z #define _BSD_SIZE_T_ 
2025-05-07T20:26:25.6229918Z #define ADJ_MICRO 0x1000
2025-05-07T20:26:25.6230262Z #define _GLIBCXX_HAVE_FABSF 1
2025-05-07T20:26:25.6230526Z #define _GLIBCXX_HAVE_FABSL 1
2025-05-07T20:26:25.6235809Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd)
2025-05-07T20:26:25.6236150Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:26:25.6236455Z #define __attribute_const__ __attribute__ ((__const__))
2025-05-07T20:26:25.6236776Z #define __THROW throw ()
2025-05-07T20:26:25.6237041Z #define __cudaGet_gridDim() gridDim
2025-05-07T20:26:25.6237348Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:25.6237694Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 
2025-05-07T20:26:25.6238041Z #define htobe32(x) __bswap_32 (x)
2025-05-07T20:26:25.6238302Z #define _GLIBCXX_HAVE_POWL 1
2025-05-07T20:26:25.6238557Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:26:25.6238816Z #define __GLIBC_HAVE_LONG_LONG 1
2025-05-07T20:26:25.6239067Z #define L_tmpnam 20
2025-05-07T20:26:25.6239284Z #define ___int_wchar_t_h 
2025-05-07T20:26:25.6239619Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status))
2025-05-07T20:26:25.6239984Z #define isascii(c) __isascii (c)
2025-05-07T20:26:25.6240236Z #define _T_PTRDIFF 
2025-05-07T20:26:25.6240531Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp)
2025-05-07T20:26:25.6240872Z #define toascii(c) __toascii (c)
2025-05-07T20:26:25.6241121Z #define __GNUC__ 11
2025-05-07T20:26:25.6241370Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE
2025-05-07T20:26:25.6241649Z #define __GXX_RTTI 1
2025-05-07T20:26:25.6241865Z #define __pie__ 2
2025-05-07T20:26:25.6242066Z #define __MMX__ 1
2025-05-07T20:26:25.6242277Z #define __cudaCDP2Malloc 
2025-05-07T20:26:25.6242518Z #define __timespec_defined 1
2025-05-07T20:26:25.6242768Z #define L_ctermid 9
2025-05-07T20:26:25.6242987Z #define __OFF64_T_TYPE __SQUAD_TYPE
2025-05-07T20:26:25.6243277Z #define __cudaCDP2GetParameterBufferV2 
2025-05-07T20:26:25.6243657Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER)
2025-05-07T20:26:25.6244017Z #define _BITS_POSIX2_LIM_H 1
2025-05-07T20:26:25.6244267Z #define _GLIBCXX98_USE_C99_STDLIB 1
2025-05-07T20:26:25.6244551Z #define cudaMemAttachGlobal 0x01
2025-05-07T20:26:25.6244849Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp)
2025-05-07T20:26:25.6245148Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:26:25.6245405Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:26:25.6245831Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1)
2025-05-07T20:26:25.6246558Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:26:25.6247141Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE
2025-05-07T20:26:25.6247432Z #define __USE_SVID 1
2025-05-07T20:26:25.6247677Z #define __constant__ __location__(constant)
2025-05-07T20:26:25.6247974Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1
2025-05-07T20:26:25.6248375Z #define __device__ __location__(device)
2025-05-07T20:26:25.6248697Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1
2025-05-07T20:26:25.6249007Z #define _GLIBCXX_RES_LIMITS 1
2025-05-07T20:26:25.6249261Z #define M_1_PI 0.31830988618379067154
2025-05-07T20:26:25.6249669Z #define CUDART_DEVICE __device__
2025-05-07T20:26:25.6250011Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW
2025-05-07T20:26:25.6250364Z #define M_PI_2 1.57079632679489661923
2025-05-07T20:26:25.6250638Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:26:25.6250993Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02
2025-05-07T20:26:25.6251357Z #define __STDC_UTF_16__ 1
2025-05-07T20:26:25.6251749Z #define LONG_MAX __LONG_MAX__
2025-05-07T20:26:25.6252104Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136)
2025-05-07T20:26:25.6252513Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4
2025-05-07T20:26:25.6252815Z #define _POSIX_HOST_NAME_MAX 255
2025-05-07T20:26:25.6253082Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:26:25.6253333Z #define NGROUPS_MAX 65536
2025-05-07T20:26:25.6253578Z #define _GLIBCXX_NAMESPACE_LDBL 
2025-05-07T20:26:25.6253827Z #define __USE_ISOC95 1
2025-05-07T20:26:25.6254043Z #define _TIME_H 1
2025-05-07T20:26:25.6254302Z #define M_LOG10El 0.434294481903251827651128918916605082L
2025-05-07T20:26:25.6254613Z #define __USE_ISOC99 1
2025-05-07T20:26:25.6254924Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname)
2025-05-07T20:26:25.6255275Z #define HOST_NAME_MAX 64
2025-05-07T20:26:25.6255515Z #define _POSIX_SEM_NSEMS_MAX 256
2025-05-07T20:26:25.6255766Z #define _IOS_ATEND 4
2025-05-07T20:26:25.6255984Z #define __SM_35_INTRINSICS_H__ 
2025-05-07T20:26:25.6256300Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status))
2025-05-07T20:26:25.6256687Z #define cudaStreamAttrValue cudaLaunchAttributeValue
2025-05-07T20:26:25.6257011Z #define _GLIBCXX_HAVE_S_ISREG 1
2025-05-07T20:26:25.6257282Z #define cudaSurfaceTypeCubemap 0x0C
2025-05-07T20:26:25.6257590Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:26:25.6257891Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:26:25.6258135Z #define _STDIO_H 1
2025-05-07T20:26:25.6258519Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type)
2025-05-07T20:26:25.6258976Z #define _GLIBCXX_PREDEFINED_OPS_H 1
2025-05-07T20:26:25.6259326Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:26:25.6259689Z #define _G_IO_IO_FILE_VERSION 0x20001
2025-05-07T20:26:25.6259968Z #define _POSIX_SIGQUEUE_MAX 32
2025-05-07T20:26:25.6260221Z #define _GLIBCXX_HAVE_GETS 1
2025-05-07T20:26:25.6260481Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1
2025-05-07T20:26:25.6260758Z #define __cpp_raw_strings 200710L
2025-05-07T20:26:25.6261047Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:25.6261350Z #define _GLIBCXX_HAVE_VFWSCANF 1
2025-05-07T20:26:25.6261611Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:26:25.6261883Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L
2025-05-07T20:26:25.6262169Z #define _GLIBCXX_STDIO_EOF -1
2025-05-07T20:26:25.6262432Z #define __SIZEOF_PTHREAD_MUTEX_T 40
2025-05-07T20:26:25.6262708Z #define __CHANNEL_DESCRIPTOR_H__ 
2025-05-07T20:26:25.6263049Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8))
2025-05-07T20:26:25.6263407Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:26:25.6263640Z #define __USE_XOPEN 1
2025-05-07T20:26:25.6263867Z #define __SIZEOF_PTHREAD_RWLOCK_T 56
2025-05-07T20:26:25.6264295Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:26:25.6264720Z #define __USE_XOPEN2K 1
2025-05-07T20:26:25.6264951Z #define _PSTL_UDR_PRESENT 1
2025-05-07T20:26:25.6265210Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:26:25.6265491Z #define _GLIBCXX_HAVE_COSF 1
2025-05-07T20:26:25.6265750Z #define __cpp_fold_expressions 201603L
2025-05-07T20:26:25.6266258Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2)
2025-05-07T20:26:25.6266857Z #define NL_LANGMAX _POSIX2_LINE_MAX
2025-05-07T20:26:25.6267129Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:26:25.6267474Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 
2025-05-07T20:26:25.6267848Z #define __DADDR_T_TYPE __S32_TYPE
2025-05-07T20:26:25.6268291Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01
2025-05-07T20:26:25.6268674Z #define __END_NAMESPACE_C99 
2025-05-07T20:26:25.6268932Z #define __glibcxx_integral_traps true
2025-05-07T20:26:25.6269204Z #define _POSIX_PATH_MAX 256
2025-05-07T20:26:25.6269445Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:26:25.6269692Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:26:25.6270043Z #define _ISOC11_SOURCE 1
2025-05-07T20:26:25.6270284Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1
2025-05-07T20:26:25.6270558Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:26:25.6270845Z #define _GLIBCXX_HAVE_QUICK_EXIT 1
2025-05-07T20:26:25.6271198Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 
2025-05-07T20:26:25.6271568Z #define LONG_MIN (-LONG_MAX - 1L)
2025-05-07T20:26:25.6271838Z #define _GLIBCXX_HAVE_SINCOSF 1
2025-05-07T20:26:25.6272091Z #define _IO_UNITBUF 020000
2025-05-07T20:26:25.6272330Z #define _GLIBCXX_HAVE_SINCOSL 1
2025-05-07T20:26:25.6272579Z #define __FD_SETSIZE 1024
2025-05-07T20:26:25.6272831Z #define getc(_fp) _IO_getc (_fp)
2025-05-07T20:26:25.6273090Z #define be32toh(x) __bswap_32 (x)
2025-05-07T20:26:25.6273419Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused"
2025-05-07T20:26:25.6273763Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:26:25.6274014Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:26:25.6274312Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l))
2025-05-07T20:26:25.6274617Z #define _GLIBCXX_HAVE_GETIPINFO 1
2025-05-07T20:26:25.6274874Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:26:25.6275163Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l))
2025-05-07T20:26:25.6275484Z #define _WCHAR_T_DEFINED_ 
2025-05-07T20:26:25.6275771Z #define cudaIpcMemLazyEnablePeerAccess 0x01
2025-05-07T20:26:25.6276082Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1
2025-05-07T20:26:25.6276358Z #define __INO_T_MATCHES_INO64_T 1
2025-05-07T20:26:25.6276619Z #define __USE_POSIX199506 1
2025-05-07T20:26:25.6276858Z #define _FEATURES_H 1
2025-05-07T20:26:25.6277087Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:26:25.6277474Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM))
2025-05-07T20:26:25.6277879Z #define __stub_getmsg 
2025-05-07T20:26:25.6278103Z #define _IO_FIXED 010000
2025-05-07T20:26:25.6278362Z #define __cpp_lib_addressof_constexpr 201603
2025-05-07T20:26:25.6278658Z #define _GLIBCXX11_USE_C99_STDIO 1
2025-05-07T20:26:25.6278914Z #define __stub_setlogin 
2025-05-07T20:26:25.6279141Z #define __stub_fattach 
2025-05-07T20:26:25.6279368Z #define __cplusplus 201703L
2025-05-07T20:26:25.6279616Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:26:25.6279886Z #define _STRUCT_TIMEVAL 1
2025-05-07T20:26:25.6280132Z #define INFINITY (__builtin_inff())
2025-05-07T20:26:25.6280397Z #define _IO_UNBUFFERED 2
2025-05-07T20:26:25.6280869Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy
2025-05-07T20:26:25.6281380Z #define _IO_INTERNAL 010
2025-05-07T20:26:25.6281610Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:26:25.6281938Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue
2025-05-07T20:26:25.6282284Z #define __dev_t_defined 
2025-05-07T20:26:25.6282508Z #define __DEPRECATED 1
2025-05-07T20:26:25.6282726Z #define __S32_TYPE int
2025-05-07T20:26:25.6282967Z #define __cpp_rvalue_references 200610L
2025-05-07T20:26:25.6283248Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:26:25.6283490Z #define _IO_fpos_t _G_fpos_t
2025-05-07T20:26:25.6283735Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:26:25.6284326Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout
2025-05-07T20:26:25.6284941Z #define _G_HAVE_MREMAP 1
2025-05-07T20:26:25.6285400Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:26:25.6285733Z #define OVERFLOW 3
2025-05-07T20:26:25.6285965Z #define __toascii_l(c,l) ((l), __toascii (c))
2025-05-07T20:26:25.6286265Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:26:25.6286551Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:25.6286996Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11
2025-05-07T20:26:25.6287317Z #define __SSE2_MATH__ 1
2025-05-07T20:26:25.6287546Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:26:25.6287838Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:25.6288119Z #define _IO_STDIO_H 
2025-05-07T20:26:25.6288349Z #define PDP_ENDIAN __PDP_ENDIAN
2025-05-07T20:26:25.6288629Z #define isspace_l(c,l) __isspace_l ((c), (l))
2025-05-07T20:26:25.6288933Z #define __cudaCDP2Memcpy2DAsync 
2025-05-07T20:26:25.6289221Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:25.6289521Z #define _GLIBCXX_HAVE_STRERROR_R 1
2025-05-07T20:26:25.6289771Z #define __amd64 1
2025-05-07T20:26:25.6289987Z #define _POSIX_TZNAME_MAX 6
2025-05-07T20:26:25.6290243Z #define __cudaCDP2Memset3DAsync 
2025-05-07T20:26:25.6290504Z #define __SYSCALL_WORDSIZE 64
2025-05-07T20:26:25.6290780Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1
2025-05-07T20:26:25.6291075Z #define _EXT_TYPE_TRAITS 1
2025-05-07T20:26:25.6291332Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1
2025-05-07T20:26:25.6291616Z #define _POSIX_RE_DUP_MAX 255
2025-05-07T20:26:25.6291869Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:26:25.6292109Z #define __bounded 
2025-05-07T20:26:25.6292333Z #define __USECONDS_T_TYPE __U32_TYPE
2025-05-07T20:26:25.6292621Z #define _IO_DELETE_DONT_CLOSE 0x40
2025-05-07T20:26:25.6292892Z #define __BEGIN_NAMESPACE_STD 
2025-05-07T20:26:25.6293154Z #define _PTRDIFF_T_DECLARED 
2025-05-07T20:26:25.6293433Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:25.6293738Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f)
2025-05-07T20:26:25.6294155Z #define cudaStreamAttributePriority cudaLaunchAttributePriority
2025-05-07T20:26:25.6294564Z #define _GLIBCXX_HAVE_NETDB_H 1
2025-05-07T20:26:25.6294826Z #define __SM_20_INTRINSICS_HPP__ 
2025-05-07T20:26:25.6295163Z #define __cpp_lib_has_unique_object_representations 201606
2025-05-07T20:26:25.6295506Z #define STA_PLL 0x0001
2025-05-07T20:26:25.6295739Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:26:25.6296014Z #define __GNUG__ 11
2025-05-07T20:26:25.6296239Z #define _GLIBCXX_USE_GET_NPROCS 1
2025-05-07T20:26:25.6296496Z #define _T_WCHAR 
2025-05-07T20:26:25.6296724Z #define __cudaCDP2GetDeviceCount 
2025-05-07T20:26:25.6297049Z #define __specialization_static 
2025-05-07T20:26:25.6297357Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:26:25.6297655Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:26:25.6297911Z #define cudaArraySparse 0x40
2025-05-07T20:26:25.6298173Z #define STA_PPSFREQ 0x0002
2025-05-07T20:26:25.6298410Z #define __GLIBCXX__ 20230528
2025-05-07T20:26:25.6298687Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_))
2025-05-07T20:26:25.6298988Z #define _WCHAR_T 
2025-05-07T20:26:25.6299200Z #define __cudaCDP2Free 
2025-05-07T20:26:25.6299829Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0)
2025-05-07T20:26:25.6300505Z #define __cpp_nsdmi 200809L
2025-05-07T20:26:25.6300918Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0)
2025-05-07T20:26:25.6301352Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:26:25.6301631Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:26:25.6301889Z #define cudaArrayCubemap 0x04
2025-05-07T20:26:25.6302218Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:26:25.6302569Z #define _GLIBCXX_UTILITY 1
2025-05-07T20:26:25.6302818Z #define __NO_CTYPE 1
2025-05-07T20:26:25.6303039Z #define __stub_bdflush 
2025-05-07T20:26:25.6303394Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter)
2025-05-07T20:26:25.6304178Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 
2025-05-07T20:26:25.6304489Z #define _GLIBCXX_STDC_HEADERS 1
2025-05-07T20:26:25.6304752Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:26:25.6305025Z #define __cpp_initializer_lists 200806L
2025-05-07T20:26:25.6305447Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1
2025-05-07T20:26:25.6305736Z #define __U16_TYPE unsigned short int
2025-05-07T20:26:25.6306069Z #define __glibcxx_requires_can_increment(_First,_Size) 
2025-05-07T20:26:25.6306410Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1
2025-05-07T20:26:25.6306686Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:26:25.6306963Z #define cudaHostRegisterIoMemory 0x04
2025-05-07T20:26:25.6307314Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS))
2025-05-07T20:26:25.6307648Z #define __cpp_lib_is_invocable 201703
2025-05-07T20:26:25.6307924Z #define _IO_STDIO 040000
2025-05-07T20:26:25.6308248Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int)))
2025-05-07T20:26:25.6308639Z #define cudaSurfaceType1DLayered 0xF1
2025-05-07T20:26:25.6308949Z #define cudaArraySurfaceLoadStore 0x02
2025-05-07T20:26:25.6309242Z #define _PTRDIFF_T 
2025-05-07T20:26:25.6309461Z #define _MOVE_H 1
2025-05-07T20:26:25.6309686Z #define __cpp_hex_float 201603L
2025-05-07T20:26:25.6310016Z #define ADJ_TAI 0x0080
2025-05-07T20:26:25.6310252Z #define __ptrvalue 
2025-05-07T20:26:25.6310479Z #define _GLIBCXX_HOSTED 1
2025-05-07T20:26:25.6310731Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:26:25.6311018Z #define __WTERMSIG(status) ((status) & 0x7f)
2025-05-07T20:26:25.6311310Z #define MATH_ERREXCEPT 2
2025-05-07T20:26:25.6311560Z #define _GLIBCXX_HAS_GTHREADS 1
2025-05-07T20:26:25.6311846Z #define cudaTextureType2DLayered 0xF2
2025-05-07T20:26:25.6312236Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0))
2025-05-07T20:26:25.6312613Z #define __USE_GNU 1
2025-05-07T20:26:25.6312843Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:26:25.6313117Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:26:25.6313384Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:26:25.6313770Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d)))
2025-05-07T20:26:25.6314151Z #define WEXITED 4
2025-05-07T20:26:25.6314364Z #define _IO_NO_READS 4
2025-05-07T20:26:25.6314668Z #define cudaGraphKernelNodePortLaunchCompletion 2
2025-05-07T20:26:25.6315017Z #define M_LOG2E 1.4426950408889634074
2025-05-07T20:26:25.6315285Z #define _POSIX_SYMLINK_MAX 255
2025-05-07T20:26:25.6315581Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1
2025-05-07T20:26:25.6315889Z #define __uid_t_defined 
2025-05-07T20:26:25.6316132Z #define __FD_ELT(d) ((d) / __NFDBITS)
2025-05-07T20:26:25.6316425Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1
2025-05-07T20:26:25.6316696Z #define WNOHANG 1
2025-05-07T20:26:25.6316940Z #define alloca(size) __builtin_alloca (size)
2025-05-07T20:26:25.6317241Z #define _GLIBCXX_HAVE_HYPOTF 1
2025-05-07T20:26:25.6317510Z #define cudaEventDefault 0x00
2025-05-07T20:26:25.6317809Z #define __maxnreg__(a) __attribute__((maxnreg(a)))
2025-05-07T20:26:25.6318116Z #define NL_SETMAX INT_MAX
2025-05-07T20:26:25.6318349Z #define __x86_64 1
2025-05-07T20:26:25.6318576Z #define __cudaCDP2LaunchDevice 
2025-05-07T20:26:25.6318966Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:25.6319443Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 {
2025-05-07T20:26:25.6319931Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__))
2025-05-07T20:26:25.6320350Z #define __PTRDIFF_T 
2025-05-07T20:26:25.6320668Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW
2025-05-07T20:26:25.6321044Z #define _GLIBCXX_HAVE_FINITEL 1
2025-05-07T20:26:25.6321319Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:25.6321599Z #define _Mlong_double_ long double
2025-05-07T20:26:25.6321871Z #define __cpp_lambdas 200907L
2025-05-07T20:26:25.6322122Z #define _IO_DEC 020
2025-05-07T20:26:25.6322340Z #define _GLIBCXX_HAVE_SINHL 1
2025-05-07T20:26:25.6322702Z #define _POSIX_CLOCKRES_MIN 20000000
2025-05-07T20:26:25.6322995Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:26:25.6323271Z #define ADJ_TIMECONST 0x0020
2025-05-07T20:26:25.6323534Z #define _GLIBCXX_HAVE_SQRTL 1
2025-05-07T20:26:25.6323835Z #define __cudaCDP2DeviceGetSharedMemConfig 
2025-05-07T20:26:25.6324233Z #define _GLIBCXX_HAVE_STDALIGN_H 1
2025-05-07T20:26:25.6324511Z #define _ANSI_STDDEF_H 
2025-05-07T20:26:25.6324776Z #define _GLIBCXX_MOVE(__val) std::move(__val)
2025-05-07T20:26:25.6325081Z #define _GLIBCXX_HAVE_STRERROR_L 1
2025-05-07T20:26:25.6325450Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:26:25.6325828Z #define _GLIBCXX_USE_DEV_RANDOM 1
2025-05-07T20:26:25.6326111Z #define _STL_ITERATOR_BASE_TYPES_H 1
2025-05-07T20:26:25.6326393Z #define __cpp_template_auto 201606L
2025-05-07T20:26:25.6326747Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:26:25.6327115Z #define _GLIBCXX_HAVE_SYS_SEM_H 1
2025-05-07T20:26:25.6327379Z #define __key_t_defined 
2025-05-07T20:26:25.6327627Z #define _IO_MAGIC_MASK 0xFFFF0000
2025-05-07T20:26:25.6327993Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__)))
2025-05-07T20:26:25.6328450Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:26:25.6328815Z #define __GNUC_VA_LIST 
2025-05-07T20:26:25.6329141Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:26:25.6329517Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:26:25.6329770Z #define CLOCK_REALTIME_COARSE 5
2025-05-07T20:26:25.6330046Z #define _GLIBCXX14_CONSTEXPR constexpr
2025-05-07T20:26:25.6330339Z #define __USE_XOPEN2KXSI 1
2025-05-07T20:26:25.6330578Z #define __WCOREFLAG 0x80
2025-05-07T20:26:25.6330907Z #define M_2_SQRTPI 1.12837916709551257390
2025-05-07T20:26:25.6331236Z #define cudaEventDisableTiming 0x02
2025-05-07T20:26:25.6331505Z #define __LP64__ 1
2025-05-07T20:26:25.6331751Z #define __isascii_l(c,l) ((l), __isascii (c))
2025-05-07T20:26:25.6332067Z #define cudaStreamNonBlocking 0x01
2025-05-07T20:26:25.6332339Z #define _IO_off64_t __off64_t
2025-05-07T20:26:25.6332594Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:26:25.6332847Z #define __time_t_defined 1
2025-05-07T20:26:25.6333091Z #define _POSIX_SYMLOOP_MAX 8
2025-05-07T20:26:25.6333437Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:26:25.6333797Z #define __USE_UNIX98 1
2025-05-07T20:26:25.6334032Z #define __MODE_T_TYPE __U32_TYPE
2025-05-07T20:26:25.6334296Z #define CLOCK_REALTIME_ALARM 8
2025-05-07T20:26:25.6334561Z #define _GLIBCXX_HAVE_STRINGS_H 1
2025-05-07T20:26:25.6334855Z #define __LEAF_ATTR __attribute__ ((__leaf__))
2025-05-07T20:26:25.6335154Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:26:25.6335412Z #define SEEK_CUR 1
2025-05-07T20:26:25.6335636Z #define __RLIM64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:25.6335893Z #define _ASSERT_H 1
2025-05-07T20:26:25.6336467Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig))
2025-05-07T20:26:25.6337089Z #define _GLIBCXX_USE_DEPRECATED 1
2025-05-07T20:26:25.6337353Z #define CHAR_MAX SCHAR_MAX
2025-05-07T20:26:25.6337606Z #define _GLIBCXX_HAVE_SETENV 1
2025-05-07T20:26:25.6337873Z #define NL_ARGMAX _POSIX_ARG_MAX
2025-05-07T20:26:25.6338144Z #define _GLIBCXX_USE_UTIMENSAT 1
2025-05-07T20:26:25.6338510Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__))
2025-05-07T20:26:25.6338910Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 
2025-05-07T20:26:25.6339562Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch)))
2025-05-07T20:26:25.6340199Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1
2025-05-07T20:26:25.6340490Z #define _IO_BOOLALPHA 0200000
2025-05-07T20:26:25.6340835Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912)
2025-05-07T20:26:25.6341303Z #define _GLIBCXX_PACKAGE_URL ""
2025-05-07T20:26:25.6341568Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:26:25.6341848Z #define cudaArrayDefault 0x00
2025-05-07T20:26:25.6342125Z #define __cudaCDP2LaunchDeviceV2 
2025-05-07T20:26:25.6342408Z #define __FDS_BITS(set) ((set)->fds_bits)
2025-05-07T20:26:25.6342784Z #define TLOSS 5
2025-05-07T20:26:25.6343000Z #define __ssize_t_defined 
2025-05-07T20:26:25.6343246Z #define __CUDACC_VER_BUILD__ 85
2025-05-07T20:26:25.6343513Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1
2025-05-07T20:26:25.6343794Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL)
2025-05-07T20:26:25.6344077Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:26:25.6344431Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11
2025-05-07T20:26:25.6344810Z #define _POSIX_HIWAT _POSIX_PIPE_BUF
2025-05-07T20:26:25.6345084Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:26:25.6345366Z #define __cudaCDP2EventRecordWithFlags 
2025-05-07T20:26:25.6345670Z #define _GLIBCXX_ATOMIC_BUILTINS 1
2025-05-07T20:26:25.6345963Z #define cudaPeerAccessDefault 0x00
2025-05-07T20:26:25.6346241Z #define __REGISTER_PREFIX__ 
2025-05-07T20:26:25.6346497Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:26:25.6346823Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 
2025-05-07T20:26:25.6347175Z #define _IOS_NOREPLACE 64
2025-05-07T20:26:25.6347413Z #define __cdecl 
2025-05-07T20:26:25.6347644Z #define cudaEventInterprocess 0x04
2025-05-07T20:26:25.6347962Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L
2025-05-07T20:26:25.6348284Z #define LOGIN_NAME_MAX 256
2025-05-07T20:26:25.6348534Z #define _IO_TIED_PUT_GET 0x400
2025-05-07T20:26:25.6348797Z #define X_TLOSS 1.41484755040568800000e+16
2025-05-07T20:26:25.6349080Z #define CUDA_IPC_HANDLE_SIZE 64
2025-05-07T20:26:25.6349345Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:26:25.6349644Z #define __attribute_pure__ __attribute__ ((__pure__))
2025-05-07T20:26:25.6350068Z #define __TEXTURE_TYPES_H__ 
2025-05-07T20:26:25.6350474Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:26:25.6350904Z #define ADJ_NANO 0x2000
2025-05-07T20:26:25.6351203Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:26:25.6351550Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:26:25.6351836Z #define _GLIBCXX_HAVE_ISWBLANK 1
2025-05-07T20:26:25.6352089Z #define __FLT_DIG__ 6
2025-05-07T20:26:25.6352432Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias)
2025-05-07T20:26:25.6352820Z #define __NO_INLINE__ 1
2025-05-07T20:26:25.6357592Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:26:25.6357961Z #define _POSIX_NGROUPS_MAX 8
2025-05-07T20:26:25.6358214Z #define ADJ_STATUS 0x0010
2025-05-07T20:26:25.6358471Z #define __cudaCDP2MemcpyAsync_ptsz 
2025-05-07T20:26:25.6358751Z #define CLOCK_BOOTTIME_ALARM 9
2025-05-07T20:26:25.6359016Z #define LONG_LONG_MAX __LONG_LONG_MAX__
2025-05-07T20:26:25.6359307Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1
2025-05-07T20:26:25.6359589Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:26:25.6359958Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000
2025-05-07T20:26:25.6360363Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1
2025-05-07T20:26:25.6360699Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:26:25.6361040Z #define CHAR_MIN SCHAR_MIN
2025-05-07T20:26:25.6361274Z #define MAX_CANON 255
2025-05-07T20:26:25.6361497Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:26:25.6361736Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:26:25.6361995Z #define _GLIBCXX_HAVE_COMPLEX_H 1
2025-05-07T20:26:25.6362269Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 
2025-05-07T20:26:25.6362561Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX
2025-05-07T20:26:25.6362848Z #define _GLIBCXX_HAVE_HYPOT 1
2025-05-07T20:26:25.6363116Z #define __cudaCDP2Memset2DAsync_ptsz 
2025-05-07T20:26:25.6363427Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1
2025-05-07T20:26:25.6363725Z #define __VERSION__ "11.4.0"
2025-05-07T20:26:25.6364083Z #define _GLIBCXX11_USE_C99_STDLIB 1
2025-05-07T20:26:25.6364371Z #define cudaHostRegisterMapped 0x02
2025-05-07T20:26:25.6364650Z #define _GLIBCXX_HAVE_INT64_T 1
2025-05-07T20:26:25.6364915Z #define _GLIBCXX_USE_CONSTEXPR constexpr
2025-05-07T20:26:25.6365213Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp)
2025-05-07T20:26:25.6365570Z #define __UINT64_C(c) c ## UL
2025-05-07T20:26:25.6365821Z #define MOD_OFFSET ADJ_OFFSET
2025-05-07T20:26:25.6366064Z #define _SYS_TYPES_H 1
2025-05-07T20:26:25.6366297Z #define AIO_PRIO_DELTA_MAX 20
2025-05-07T20:26:25.6366545Z #define _GLIBCXX_HAVE_TANHF 1
2025-05-07T20:26:25.6366784Z #define _SYS_CDEFS_H 1
2025-05-07T20:26:25.6367008Z #define _GLIBCXX_HAVE_TANHL 1
2025-05-07T20:26:25.6367271Z #define __cpp_unicode_characters 201411L
2025-05-07T20:26:25.6367550Z #define _IO_ERR_SEEN 0x20
2025-05-07T20:26:25.6367792Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1
2025-05-07T20:26:25.6368075Z #define __cudaCDP2StreamDestroy 
2025-05-07T20:26:25.6368339Z #define FP_SUBNORMAL 3
2025-05-07T20:26:25.6368581Z #define cudaOccupancyDefault 0x00
2025-05-07T20:26:25.6368852Z #define _INITIALIZER_LIST 
2025-05-07T20:26:25.6369093Z #define _STDC_PREDEF_H 1
2025-05-07T20:26:25.6369327Z #define __CUDA_RUNTIME_API_H__ 
2025-05-07T20:26:25.6369595Z #define _GLIBCXX_PACKAGE_BUGREPORT ""
2025-05-07T20:26:25.6369879Z #define _GLIBCXX_HAVE_MODF 1
2025-05-07T20:26:25.6370125Z #define _IO_file_flags _flags
2025-05-07T20:26:25.6370374Z #define __USE_XOPEN2K8 1
2025-05-07T20:26:25.6370612Z #define htobe64(x) __bswap_64 (x)
2025-05-07T20:26:25.6370878Z #define _OLD_STDIO_MAGIC 0xFABC0000
2025-05-07T20:26:25.6371142Z #define HUGE 3.40282347e+38F
2025-05-07T20:26:25.6371401Z #define __cpp_lib_is_null_pointer 201309
2025-05-07T20:26:25.6371765Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status))
2025-05-07T20:26:25.6372146Z #define islower_l(c,l) __islower_l ((c), (l))
2025-05-07T20:26:25.6372442Z #define _GLIBCXX_USE_CXX11_ABI 1
2025-05-07T20:26:25.6372700Z #define _GLIBCXX_HAVE_SYMLINK 1
2025-05-07T20:26:25.6372947Z #define _BSD_SOURCE 1
2025-05-07T20:26:25.6373173Z #define _GLIBCXX_THROW(_EXC) 
2025-05-07T20:26:25.6374010Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template<typename _Tp, typename = __void_t<>> struct __has_ ##_NTYPE : false_type { }; template<typename _Tp> struct __has_ ##_NTYPE<_Tp, __void_t<typename _Tp::_NTYPE>> : true_type { };
2025-05-07T20:26:25.6374833Z #define __catch(X) catch(X)
2025-05-07T20:26:25.6375089Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:26:25.6375366Z #define LINE_MAX _POSIX2_LINE_MAX
2025-05-07T20:26:25.6375633Z #define __TIMER_T_TYPE void *
2025-05-07T20:26:25.6375874Z #define __STRING(x) #x
2025-05-07T20:26:25.6376105Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:26:25.6376369Z #define _T_PTRDIFF_ 
2025-05-07T20:26:25.6376598Z #define _GLIBCXX_USE_NOEXCEPT noexcept
2025-05-07T20:26:25.6376889Z #define cudaEventWaitExternal 0x01
2025-05-07T20:26:25.6377154Z #define __unbounded 
2025-05-07T20:26:25.6377382Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:25.6377668Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:26:25.6377937Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:25.6378220Z #define be16toh(x) __bswap_16 (x)
2025-05-07T20:26:25.6378487Z #define __cpp_lib_is_final 201402L
2025-05-07T20:26:25.6378774Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 
2025-05-07T20:26:25.6379091Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL)
2025-05-07T20:26:25.6379386Z #define __MATH_DECLARE_LDOUBLE 1
2025-05-07T20:26:25.6379653Z #define __managed__ __location__(managed)
2025-05-07T20:26:25.6379943Z #define _POSIX2_EXPR_NEST_MAX 32
2025-05-07T20:26:25.6380328Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:26:25.6380738Z #define _POSIX_STREAM_MAX 8
2025-05-07T20:26:25.6380990Z #define __LIBRARY_TYPES_H__ 
2025-05-07T20:26:25.6381347Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11
2025-05-07T20:26:25.6381740Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:26:25.6382071Z #define _SYS_SIZE_T_H 
2025-05-07T20:26:25.6382352Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10)
2025-05-07T20:26:25.6382677Z #define _GLIBCXX_STDLIB_H 1
2025-05-07T20:26:25.6382948Z #define isupper_l(c,l) __isupper_l ((c), (l))
2025-05-07T20:26:25.6383228Z #define _CRTIMP 
2025-05-07T20:26:25.6383522Z #define _GLIBCXX_CXX_CONFIG_H 1
2025-05-07T20:26:25.6383815Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:26:25.6384132Z #define STA_PPSJITTER 0x0200
2025-05-07T20:26:25.6384474Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0)
2025-05-07T20:26:25.6384869Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:25.6385174Z #define _GLIBCXX_HAVE_ISINFF 1
2025-05-07T20:26:25.6385439Z #define __glibcxx_requires_subscript(_N) 
2025-05-07T20:26:25.6385717Z #define __SIZE_T__ 
2025-05-07T20:26:25.6385921Z #define __stub_gtty 
2025-05-07T20:26:25.6386138Z #define __pid_t_defined 
2025-05-07T20:26:25.6386383Z #define _GLIBCXX_FWDREF(_Tp) _Tp&&
2025-05-07T20:26:25.6386685Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:25.6386986Z #define __glibcxx_function_requires(...) 
2025-05-07T20:26:25.6387263Z #define __SM_80_RT_HPP__ 
2025-05-07T20:26:25.6387497Z #define __need_clockid_t 
2025-05-07T20:26:25.6387726Z #define SSIZE_MAX LONG_MAX
2025-05-07T20:26:25.6387983Z #define _GLIBCXX_HAVE_USELOCALE 1
2025-05-07T20:26:25.6388294Z #define __glibcxx_requires_string_len(_String,_Len) 
2025-05-07T20:26:25.6388599Z #define _IO_HEX 0100
2025-05-07T20:26:25.6388844Z #define __NFDBITS (8 * (int) sizeof (__fd_mask))
2025-05-07T20:26:25.6389168Z #define cudaExternalMemoryDedicated 0x1
2025-05-07T20:26:25.6389468Z #define _GLIBCXX_HAVE_TGMATH_H 1
2025-05-07T20:26:25.6389730Z #define _GLIBCXX11_USE_C99_COMPLEX 1
2025-05-07T20:26:25.6390238Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:26:25.6390665Z #define ispunct_l(c,l) __ispunct_l ((c), (l))
2025-05-07T20:26:25.6390967Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:26:25.6391251Z #define __cudaGet_blockDim() blockDim
2025-05-07T20:26:25.6391354Z #define __cudaCDP2Memcpy3DAsync 
2025-05-07T20:26:25.6391455Z #define __cudaCDP2MemcpyAsync 
2025-05-07T20:26:25.6391535Z #define __stub_sstk 
2025-05-07T20:26:25.6391625Z #define _IO_IN_BACKUP 0x100
2025-05-07T20:26:25.6391783Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB
2025-05-07T20:26:25.6391866Z #define __wur 
2025-05-07T20:26:25.6391983Z #define isprint_l(c,l) __isprint_l ((c), (l))
2025-05-07T20:26:25.6392069Z #define _G_HAVE_MMAP 1
2025-05-07T20:26:25.6392146Z #define _IO_OCT 040
2025-05-07T20:26:25.6392240Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:26:25.6392325Z #define NL_MSGMAX INT_MAX
2025-05-07T20:26:25.6392411Z #define _GLIBCXX_USE_LFS 1
2025-05-07T20:26:25.6392541Z #define cudaDeviceScheduleBlockingSync 0x04
2025-05-07T20:26:25.6392627Z #define _POSIX_RTSIG_MAX 8
2025-05-07T20:26:25.6392725Z #define _GLIBCXX_NOEXCEPT noexcept
2025-05-07T20:26:25.6392917Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 
2025-05-07T20:26:25.6393008Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:26:25.6393095Z #define _STL_ALGOBASE_H 1
2025-05-07T20:26:25.6393199Z #define __cudaCDP2MemsetAsync_ptsz 
2025-05-07T20:26:25.6393285Z #define __off64_t_defined 
2025-05-07T20:26:25.6393389Z #define _GLIBCXX_WEAK_DEFINITION 
2025-05-07T20:26:25.6393472Z #define __FLT128_DIG__ 33
2025-05-07T20:26:25.6393571Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1
2025-05-07T20:26:25.6393665Z #define _GLIBCXX_HAVE_LOCALE_H 1
2025-05-07T20:26:25.6393746Z #define __INT32_C(c) c
2025-05-07T20:26:25.6393836Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:26:25.6393941Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:26:25.6394033Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:26:25.6394120Z #define __PDP_ENDIAN 3412
2025-05-07T20:26:25.6394206Z #define _ISOC95_SOURCE 1
2025-05-07T20:26:25.6394300Z #define _IO_fpos64_t _G_fpos64_t
2025-05-07T20:26:25.6394431Z #define M_PI_2l 1.570796326794896619231321691639751442L
2025-05-07T20:26:25.6394615Z #define BYTE_ORDER __BYTE_ORDER
2025-05-07T20:26:25.6394701Z #define __SM_90_RT_HPP__ 
2025-05-07T20:26:25.6394797Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:26:25.6394888Z #define __have_pthread_attr_t 1
2025-05-07T20:26:25.6394986Z #define _GLIBCXX_HAVE_LIMIT_DATA 1
2025-05-07T20:26:25.6395286Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11
2025-05-07T20:26:25.6395390Z #define __cudaCDP2StreamWaitEvent 
2025-05-07T20:26:25.6395488Z #define __cudaCDP2EventRecord 
2025-05-07T20:26:25.6395582Z #define _BITS_TYPESIZES_H 1
2025-05-07T20:26:25.6395663Z #define htole32(x) (x)
2025-05-07T20:26:25.6395913Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 
2025-05-07T20:26:25.6396032Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE
2025-05-07T20:26:25.6396127Z #define _GLIBCXX_USE_C99_MATH_TR1 1
2025-05-07T20:26:25.6396283Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status))
2025-05-07T20:26:25.6396421Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH
2025-05-07T20:26:25.6396540Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:26:25.6396677Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0)
2025-05-07T20:26:25.6396762Z #define ADJ_OFFSET 0x0001
2025-05-07T20:26:25.6396857Z #define cudaArrayLayered 0x01
2025-05-07T20:26:25.6397032Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800)
2025-05-07T20:26:25.6397134Z #define cudaEventRecordDefault 0x00
2025-05-07T20:26:25.6397227Z #define _GLIBCXX_HAVE_FMODF 1
2025-05-07T20:26:25.6397325Z #define _PSTL_PRAGMA_MESSAGE(x) 
2025-05-07T20:26:25.6397401Z #define unix 1
2025-05-07T20:26:25.6397492Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:26:25.6397585Z #define _POSIX_CHILD_MAX 25
2025-05-07T20:26:25.6397675Z #define _POSIX_MAX_INPUT 255
2025-05-07T20:26:25.6397791Z #define __cudaCDP2DeviceGetCacheConfig 
2025-05-07T20:26:25.6397872Z #define __USE_POSIX 1
2025-05-07T20:26:25.6397962Z #define __FD_ZERO_STOS "stosq"
2025-05-07T20:26:25.6398097Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000)
2025-05-07T20:26:25.6398182Z #define __THROWNL throw ()
2025-05-07T20:26:25.6398269Z #define __cpp_rtti 199711L
2025-05-07T20:26:25.6398371Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:26:25.6398452Z #define __PMT(args) args
2025-05-07T20:26:25.6398567Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:25.6398709Z #define __va_arg_pack_len() __builtin_va_arg_pack_len ()
2025-05-07T20:26:25.6398816Z #define __ULONGWORD_TYPE unsigned long int
2025-05-07T20:26:25.6398907Z #define _SIZE_T_DECLARED 
2025-05-07T20:26:25.6398998Z #define _PSTL_STRING_AUX(x) #x
2025-05-07T20:26:25.6399087Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:26:25.6399477Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402)
2025-05-07T20:26:25.6399572Z #define _GLIBCXX_HAVE_LIMIT_AS 1
2025-05-07T20:26:25.6399661Z #define XATTR_LIST_MAX 65536
2025-05-07T20:26:25.6399753Z #define __CUDACC_VER_MAJOR__ 12
2025-05-07T20:26:25.6399903Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:26:25.6399988Z #define _WCHAR_T_H 
2025-05-07T20:26:25.6400071Z #define __FLT64X_DIG__ 18
2025-05-07T20:26:25.6400157Z #define _IO_SHOWBASE 0200
2025-05-07T20:26:25.6400241Z #define _POSIX_QLIMIT 1
2025-05-07T20:26:25.6400338Z #define __INT8_TYPE__ signed char
2025-05-07T20:26:25.6400426Z #define __SURFACE_TYPES_H__ 
2025-05-07T20:26:25.6400515Z #define __CUDA_ARCH__ 520
2025-05-07T20:26:25.6400616Z #define __cpp_digit_separators 201309L
2025-05-07T20:26:25.6400694Z #define __ELF__ 1
2025-05-07T20:26:25.6400793Z #define CLOCK_THREAD_CPUTIME_ID 3
2025-05-07T20:26:25.6400888Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:26:25.6400969Z #define STA_INS 0x0010
2025-05-07T20:26:25.6401065Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:26:25.6401233Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)])
2025-05-07T20:26:25.6401326Z #define _BITS_BYTESWAP_H 1
2025-05-07T20:26:25.6401502Z #define __ID_T_TYPE __U32_TYPE
2025-05-07T20:26:25.6401609Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:25.6401715Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 
2025-05-07T20:26:25.6401810Z #define _GLIBCXX_HAVE_MBSTATE_T 1
2025-05-07T20:26:25.6401909Z #define __cpp_lib_logical_traits 201510
2025-05-07T20:26:25.6403633Z #define ADJ_OFFSET_SS_READ 0xa001
2025-05-07T20:26:25.6404064Z #define __warnattr(msg) __attribute__((__warning__ (msg)))
2025-05-07T20:26:25.6404225Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: "
2025-05-07T20:26:25.6404326Z #define _IO_funlockfile(_fp) 
2025-05-07T20:26:25.6404644Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:26:25.6404772Z #define M_2_PIl 0.636619772367581343075535053490057448L
2025-05-07T20:26:25.6404863Z #define __DRIVER_TYPES_H__ 
2025-05-07T20:26:25.6404946Z #define __FLT_RADIX__ 2
2025-05-07T20:26:25.6405049Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:26:25.6405217Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:26:25.6405307Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:26:25.6405403Z #define _GLIBCXX_USE_LSTAT 1
2025-05-07T20:26:25.6405498Z #define minor(dev) gnu_dev_minor (dev)
2025-05-07T20:26:25.6405590Z #define _POSIX_C_SOURCE 200809L
2025-05-07T20:26:25.6405698Z #define _GLIBCXX_HAVE_DIRENT_H 1
2025-05-07T20:26:25.6405795Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:26:25.6405877Z #define WORD_BIT 32
2025-05-07T20:26:25.6405963Z #define _IO_USER_BUF 1
2025-05-07T20:26:25.6406053Z #define __VECTOR_TYPES_H__ 
2025-05-07T20:26:25.6406157Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:25.6406260Z #define cudaHostAllocPortable 0x01
2025-05-07T20:26:25.6406357Z #define PTHREAD_STACK_MIN 16384
2025-05-07T20:26:25.6406457Z #define __long_double_t long double
2025-05-07T20:26:25.6406547Z #define _GLIBCXX_HAVE_ISINF 1
2025-05-07T20:26:25.6406633Z #define _POSIX_ARG_MAX 4096
2025-05-07T20:26:25.6407038Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode
2025-05-07T20:26:25.6407121Z #define __k8 1
2025-05-07T20:26:25.6407311Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23)
2025-05-07T20:26:25.6407482Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:26:25.6407596Z #define __LDBL_REDIR(name,proto) name proto
2025-05-07T20:26:25.6407694Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:26:25.6407788Z #define __SM_30_INTRINSICS_HPP__ 
2025-05-07T20:26:25.6407886Z #define _GLIBCXX_EXTERN_TEMPLATE 1
2025-05-07T20:26:25.6407980Z #define __blksize_t_defined 
2025-05-07T20:26:25.6408070Z #define _IO_SHOWPOINT 0400
2025-05-07T20:26:25.6408164Z #define _GLIBCXX_HAVE_LIMIT_RSS 1
2025-05-07T20:26:25.6408277Z #define cudaDeviceLmemResizeToMax 0x10
2025-05-07T20:26:25.6408368Z #define _GLIBCXX_X86_RDRAND 1
2025-05-07T20:26:25.6408467Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:26:25.6408559Z #define _IO_IS_FILEBUF 0x2000
2025-05-07T20:26:25.6408653Z #define _GLIBCXX_USE_DUAL_ABI 1
2025-05-07T20:26:25.6408907Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8)))
2025-05-07T20:26:25.6409242Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2)
2025-05-07T20:26:25.6409345Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1)
2025-05-07T20:26:25.6409444Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:26:25.6409525Z #define SEEK_SET 0
2025-05-07T20:26:25.6409618Z #define _GLIBCXX_TR1_GAMMA_TCC 1
2025-05-07T20:26:25.6409709Z #define __CUDA_API_VER_MINOR__ 6
2025-05-07T20:26:25.6409905Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V)))
2025-05-07T20:26:25.6410005Z #define _GLIBCXX20_DEPRECATED(MSG) 
2025-05-07T20:26:25.6410107Z #define __cudaCDP2GetLastError 
2025-05-07T20:26:25.6410201Z #define _GLIBCXX_HAVE_COSL 1
2025-05-07T20:26:25.6410291Z #define _MATH_H_MATHDEF 1
2025-05-07T20:26:25.6410774Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24))
2025-05-07T20:26:25.6410875Z #define _GLIBCXX_USE_FLOAT128 1
2025-05-07T20:26:25.6410969Z #define _IO_FLAGS2_NOTCANCEL 2
2025-05-07T20:26:25.6411067Z #define __stub_sigreturn 
2025-05-07T20:26:25.6411298Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg)))
2025-05-07T20:26:25.6411502Z #define _GLIBCXX_HAVE_UTIME_H 1
2025-05-07T20:26:25.6411595Z #define __HOST_CONFIG_H__ 
2025-05-07T20:26:25.6411692Z #define _XOPEN_SOURCE_EXTENDED 1
2025-05-07T20:26:25.6411773Z #define CLOCK_TAI 11
2025-05-07T20:26:25.6411879Z #define _GLIBCXX_END_NAMESPACE_VERSION 
2025-05-07T20:26:25.6411965Z #define __restrict_arr 
2025-05-07T20:26:25.6412082Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 
2025-05-07T20:26:25.6412220Z #define __glibcxx_requires_valid_range(_First,_Last) 
2025-05-07T20:26:25.6412737Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:26:25.6412923Z #define __attribute_artificial__ __attribute__ ((__artificial__))
2025-05-07T20:26:25.6413009Z #define __USE_MISC 1
2025-05-07T20:26:25.6413109Z #define __UWORD_TYPE unsigned long int
2025-05-07T20:26:25.6413215Z #define _EXCEPTION_DEFINES_H 1
2025-05-07T20:26:25.6413299Z #define _GCC_LIMITS_H_ 
2025-05-07T20:26:25.6413391Z #define __LDBL_DIG__ 18
2025-05-07T20:26:25.6413486Z #define __BIT_TYPES_DEFINED__ 1
2025-05-07T20:26:25.6413588Z #define __malloc_and_calloc_defined 
2025-05-07T20:26:25.6413687Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:26:25.6413788Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1
2025-05-07T20:26:25.6413868Z #define __x86_64__ 1
2025-05-07T20:26:25.6413952Z #define _SIZE_T_ 
2025-05-07T20:26:25.6414834Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56)))
2025-05-07T20:26:25.6414943Z #define _POSIX2_COLL_WEIGHTS_MAX 2
2025-05-07T20:26:25.6415036Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:26:25.6415154Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1
2025-05-07T20:26:25.6415273Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:26:25.6415365Z #define _IO_iconv_t _G_iconv_t
2025-05-07T20:26:25.6415471Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1
2025-05-07T20:26:25.6415595Z #define __cpp_lib_make_reverse_iterator 201402
2025-05-07T20:26:25.6415731Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 
2025-05-07T20:26:25.6415824Z #define _GLIBCXX_HAVE_DLFCN_H 1
2025-05-07T20:26:25.6416283Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:26:25.6416405Z #define __no_return__ __attribute__((noreturn))
2025-05-07T20:26:25.6416550Z #define __device_builtin__ __location__(device_builtin)
2025-05-07T20:26:25.6416652Z #define _PSTL_HIDE_FROM_ABI_POP 
2025-05-07T20:26:25.6416744Z #define _GLIBCXX_HAVE_ACOSF 1
2025-05-07T20:26:25.6416836Z #define STA_FLL 0x0008
2025-05-07T20:26:25.6416975Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1
2025-05-07T20:26:25.6417068Z #define _GLIBCXX_END_EXTERN_C }
2025-05-07T20:26:25.6417193Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:25.6417300Z #define __cpp_lib_integer_sequence 201304
2025-05-07T20:26:25.6417383Z #define __stub_revoke 
2025-05-07T20:26:25.6417476Z #define __timer_t_defined 1
2025-05-07T20:26:25.6417605Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED
2025-05-07T20:26:25.6417697Z #define INT_MAX __INT_MAX__
2025-05-07T20:26:25.6417798Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1)
2025-05-07T20:26:25.6417898Z #define _GLIBCXX_END_NAMESPACE_CXX11 }
2025-05-07T20:26:25.6418079Z #define _GLIBCXX_ICONV_CONST 
2025-05-07T20:26:25.6418181Z #define major(dev) gnu_dev_major (dev)
2025-05-07T20:26:25.6418286Z #define cudaArrayTextureGather 0x08
2025-05-07T20:26:25.6418385Z #define _GLIBCXX_LT_OBJDIR ".libs/"
2025-05-07T20:26:25.6418525Z #define __inline_hint__ __attribute__((nv_inline_hint))
2025-05-07T20:26:25.6418693Z #define __NV_LEGACY_LAUNCH 1
2025-05-07T20:26:25.6418784Z #define _IO_off_t __off_t
2025-05-07T20:26:25.6418867Z #define __FLT64_DIG__ 15
2025-05-07T20:26:25.6419083Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS
2025-05-07T20:26:25.6419182Z #define _POSIX2_LINE_MAX 2048
2025-05-07T20:26:25.6419310Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:25.6419433Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:26:25.6419525Z #define ADJ_FREQUENCY 0x0002
2025-05-07T20:26:25.6419624Z #define __CUDART_API_PTDS(api) api
2025-05-07T20:26:25.6419711Z #define NULL __null
2025-05-07T20:26:25.6419844Z #define cudaStreamPerThread ((cudaStream_t)0x2)
2025-05-07T20:26:25.6419944Z #define _GLIBCXX_CONSTEXPR constexpr
2025-05-07T20:26:25.6420048Z #define __U64_TYPE unsigned long int
2025-05-07T20:26:25.6420141Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:26:25.6420229Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:26:25.6420320Z #define FP_ZERO 2
2025-05-07T20:26:25.6420413Z #define _GLIBCXX_HAVE_FLOORL 1
2025-05-07T20:26:25.6420565Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l))
2025-05-07T20:26:25.6420669Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:25.6420751Z #define __WCHAR_T__ 
2025-05-07T20:26:25.6420848Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:26:25.6421041Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:26:25.6421189Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__))
2025-05-07T20:26:25.6421287Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:26:25.6421404Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:26:25.6421518Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 
2025-05-07T20:26:25.6421648Z #define __WSTOPSIG(status) __WEXITSTATUS(status)
2025-05-07T20:26:25.6421773Z #define cudaSurfaceTypeCubemapLayered 0xFC
2025-05-07T20:26:25.6421869Z #define _BSD_PTRDIFF_T_ 
2025-05-07T20:26:25.6421959Z #define _SIGSET_H_types 1
2025-05-07T20:26:25.6422075Z #define cudaTextureType1DLayered 0xF1
2025-05-07T20:26:25.6422187Z #define __cpp_unicode_literals 200710L
2025-05-07T20:26:25.6422332Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l))
2025-05-07T20:26:25.6422430Z #define __LONG_LONG_PAIR(HI,LO) LO, HI
2025-05-07T20:26:25.6422553Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:26:25.6422681Z #define __bos0(ptr) __builtin_object_size (ptr, 0)
2025-05-07T20:26:25.6422784Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:26:25.6422914Z #define M_1_PIl 0.318309886183790671537767526745028724L
2025-05-07T20:26:25.6423085Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status))
2025-05-07T20:26:25.6423186Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:26:25.6423289Z #define _POSIX2_CHARCLASS_NAME_MAX 14
2025-05-07T20:26:25.6423384Z #define _GLIBCXX_BITS_STD_ABS_H 
2025-05-07T20:26:25.6423476Z #define STA_MODE 0x4000
2025-05-07T20:26:25.6423581Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:26:25.6423684Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:26:25.6423802Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0)
2025-05-07T20:26:25.6423900Z #define __USING_NAMESPACE_C99(name) 
2025-05-07T20:26:25.6423993Z #define BIG_ENDIAN __BIG_ENDIAN
2025-05-07T20:26:25.6424100Z #define __cudaCDP2EventRecord_ptsz 
2025-05-07T20:26:25.6424194Z #define _GLIBCXX_HAVE_SINL 1
2025-05-07T20:26:25.6424303Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX
2025-05-07T20:26:25.6424392Z #define __SIZE_WIDTH__ 64
2025-05-07T20:26:25.6424506Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:25.6424592Z #define __SEG_FS 1
2025-05-07T20:26:25.6424681Z #define _IO_size_t size_t
2025-05-07T20:26:25.6424857Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:26:25.6424957Z #define INT_MIN (-INT_MAX - 1)
2025-05-07T20:26:25.6425044Z #define __stub_lchmod 
2025-05-07T20:26:25.6425134Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:26:25.6425245Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:25.6425414Z #define _GLIBCXX_MANGLE_SIZE_T m
2025-05-07T20:26:25.6425495Z #define __SEG_GS 1
2025-05-07T20:26:25.6425677Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:26:25.6425760Z #define _IOS_APPEND 8
2025-05-07T20:26:25.6425850Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:26:25.6425943Z #define _GLIBCXX_RELEASE 11
2025-05-07T20:26:25.6426037Z #define _GLIBCXX98_USE_C99_WCHAR 1
2025-05-07T20:26:25.6426139Z #define _IO_IS_APPENDING 0x1000
2025-05-07T20:26:25.6426235Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:26:25.6426318Z #define htole16(x) (x)
2025-05-07T20:26:25.6426431Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:26:25.6426522Z #define _GLIBCXX_HAVE_FCNTL_H 1
2025-05-07T20:26:25.6426622Z #define __INT16_TYPE__ short int
2025-05-07T20:26:25.6426727Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:26:25.6426830Z #define __glibcxx_class_requires(_a,_b) 
2025-05-07T20:26:25.6426936Z #define __cpp_structured_bindings 201606L
2025-05-07T20:26:25.6427064Z #define __align__(n) __attribute__((aligned(n)))
2025-05-07T20:26:25.6427150Z #define __SIZEOF_INT__ 4
2025-05-07T20:26:25.6427238Z #define __WCLONE 0x80000000
2025-05-07T20:26:25.6427332Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:26:25.6427414Z #define SEEK_HOLE 4
2025-05-07T20:26:25.6427504Z #define TIMER_ABSTIME 1
2025-05-07T20:26:25.6427594Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:26:25.6427684Z #define __CUDA_MATH_CRTIMP 
2025-05-07T20:26:25.6427859Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:26:25.6427967Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:25.6428061Z #define __DRIVER_FUNCTIONS_H__ 
2025-05-07T20:26:25.6428177Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:26:25.6428271Z #define __MATH_FUNCTIONS_HPP__ 
2025-05-07T20:26:25.6428389Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:26:25.6428482Z #define _LINUX_LIMITS_H 
2025-05-07T20:26:25.6428561Z #define linux 1
2025-05-07T20:26:25.6428661Z #define MOD_MICRO ADJ_MICRO
2025-05-07T20:26:25.6428768Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 
2025-05-07T20:26:25.6428867Z #define _GLIBCXX_HAVE_VSWSCANF 1
2025-05-07T20:26:25.6428964Z #define _GLIBCXX_HAVE_ISNAN 1
2025-05-07T20:26:25.6429070Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV
2025-05-07T20:26:25.6429211Z #define __cudart_builtin__ __location__(cudart_builtin)
2025-05-07T20:26:25.6429313Z #define __cpp_lib_hypot 201603
2025-05-07T20:26:25.6429408Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:26:25.6429504Z #define _GLIBCXX_HAVE_WCTYPE_H 1
2025-05-07T20:26:25.6429592Z #define MOD_NANO ADJ_NANO
2025-05-07T20:26:25.6429674Z #define htole64(x) (x)
2025-05-07T20:26:25.6429771Z #define FP_ILOGBNAN (-2147483647 - 1)
2025-05-07T20:26:25.6429998Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_))
2025-05-07T20:26:25.6430104Z #define _IO_UPPERCASE 01000
2025-05-07T20:26:25.6430599Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference
2025-05-07T20:26:25.6430688Z #define __USE_POSIX2 1
2025-05-07T20:26:25.6430785Z #define MOD_ESTERROR ADJ_ESTERROR
2025-05-07T20:26:25.6430877Z #define __WALL 0x40000000
2025-05-07T20:26:25.6430970Z #define _GLIBCXX_HAVE_LDEXPF 1
2025-05-07T20:26:25.6431051Z #define _XLOCALE_H 1
2025-05-07T20:26:25.6431149Z #define _GLIBCXX_USE_TMPNAM 1
2025-05-07T20:26:25.6431245Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:26:25.6431336Z #define __KEY_T_TYPE __S32_TYPE
2025-05-07T20:26:25.6431444Z #define __cudaGet_threadIdx() threadIdx
2025-05-07T20:26:25.6431527Z #define __EXCEPTIONS 1
2025-05-07T20:26:25.6431624Z #define __CUDART_API_PTSZ(api) api
2025-05-07T20:26:25.6431937Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__))
2025-05-07T20:26:25.6432021Z #define __WORDSIZE 64
2025-05-07T20:26:25.6432117Z #define CLOCK_MONOTONIC 1
2025-05-07T20:26:25.6432203Z #define _STL_RELOPS_H 1
2025-05-07T20:26:25.6432296Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:26:25.6432397Z #define __BEGIN_DECLS extern "C" {
2025-05-07T20:26:25.6432570Z #define _GLIBCXX_HAVE_SYS_IPC_H 1
2025-05-07T20:26:25.6432662Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:26:25.6432763Z #define _GLIBCXX_HAVE_TRUNCATE 1
2025-05-07T20:26:25.6433063Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension
2025-05-07T20:26:25.6433288Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:26:25.6433408Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11::
2025-05-07T20:26:25.6433506Z #define _GLIBCXX_NUMERIC_LIMITS 1
2025-05-07T20:26:25.6433615Z #define __cpp_range_based_for 201603L
2025-05-07T20:26:25.6433726Z #define __cpp_lib_exchange_function 201304
2025-05-07T20:26:25.6433831Z #define _GLIBCXX_HAVE_INTTYPES_H 1
2025-05-07T20:26:25.6433940Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1
2025-05-07T20:26:25.6434121Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02
2025-05-07T20:26:25.6434215Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:26:25.6434311Z #define _GLIBCXX_CSTDLIB 1
2025-05-07T20:26:25.6434418Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1
2025-05-07T20:26:25.6434589Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:26:25.6434706Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:26:25.6434786Z #define _STRING_H 1
2025-05-07T20:26:25.6434885Z #define _BITS_PTHREADTYPES_H 1
2025-05-07T20:26:25.6434972Z #define _GCC_MAX_ALIGN_T 
2025-05-07T20:26:25.6435068Z #define __SM_32_INTRINSICS_HPP__ 
2025-05-07T20:26:25.6435206Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:26:25.6435299Z #define __code_model_small__ 1
2025-05-07T20:26:25.6435387Z #define _PSTL_CONFIG_H 
2025-05-07T20:26:25.6435491Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:26:25.6435606Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:26:25.6435699Z #define __SM_20_INTRINSICS_H__ 
2025-05-07T20:26:25.6435806Z #define cudaCpuDeviceId ((int)-1)
2025-05-07T20:26:25.6436137Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:26:25.6436240Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:26:25.6436323Z #define le64toh(x) (x)
2025-05-07T20:26:25.6436421Z #define FILENAME_MAX 4096
2025-05-07T20:26:25.6436594Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l))
2025-05-07T20:26:25.6436726Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:26:25.6436807Z #define L_cuserid 9
2025-05-07T20:26:25.6436897Z #define __ino_t_defined 
2025-05-07T20:26:25.6436978Z #define __k8__ 1
2025-05-07T20:26:25.6437072Z #define __INTPTR_TYPE__ long int
2025-05-07T20:26:25.6437183Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:26:25.6437269Z #define __int8_t_defined 
2025-05-07T20:26:25.6437367Z #define __WCHAR_TYPE__ int
2025-05-07T20:26:25.6437470Z #define __CLOCKID_T_TYPE __S32_TYPE
2025-05-07T20:26:25.6437580Z #define cudaHostRegisterPortable 0x01
2025-05-07T20:26:25.6437679Z #define __SLONGWORD_TYPE long int
2025-05-07T20:26:25.6437760Z #define _IOS_TRUNC 16
2025-05-07T20:26:25.6437881Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++"
2025-05-07T20:26:25.6438027Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l))
2025-05-07T20:26:25.6438109Z #define __HAVE_COLUMN 
2025-05-07T20:26:25.6438193Z #define __stub_fdetach 
2025-05-07T20:26:25.6438603Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported.  Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead."
2025-05-07T20:26:25.6438686Z #define __pic__ 2
2025-05-07T20:26:25.6438802Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:25.6438903Z #define CLOCKS_PER_SEC 1000000l
2025-05-07T20:26:25.6438995Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:26:25.6439185Z #define _GLIBCXX_HAVE_SOCKATMARK 1
2025-05-07T20:26:25.6439272Z #define __stub_chflags 
2025-05-07T20:26:25.6439360Z #define CLOCK_BOOTTIME 7
2025-05-07T20:26:25.6439451Z #define __need_IOV_MAX 
2025-05-07T20:26:25.6439556Z #define putc(_ch,_fp) _IO_putc (_ch, _fp)
2025-05-07T20:26:25.6439657Z #define __UQUAD_TYPE unsigned long int
2025-05-07T20:26:25.6439831Z #define __cpp_decltype 200707L
2025-05-07T20:26:25.6439926Z #define __BYTE_ORDER __LITTLE_ENDIAN
2025-05-07T20:26:25.6440014Z #define _GLIBCXX_USE_C99 1
2025-05-07T20:26:25.6440122Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1
2025-05-07T20:26:25.6440206Z #define TTY_NAME_MAX 32
2025-05-07T20:26:25.6440368Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val)
2025-05-07T20:26:25.6440495Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:25.6440661Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition)
2025-05-07T20:26:25.6440773Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:26:25.6440866Z #define __LITTLE_ENDIAN 1234
2025-05-07T20:26:25.6440962Z #define STA_PPSTIME 0x0004
2025-05-07T20:26:25.6441050Z #define __import__ 
2025-05-07T20:26:25.6441137Z #define BUFSIZ _IO_BUFSIZ
2025-05-07T20:26:25.6441269Z #define M_SQRT2l 1.414213562373095048801688724209698079L
2025-05-07T20:26:25.6441351Z #define __export__ 
2025-05-07T20:26:25.6441474Z #define __FSID_T_TYPE struct { int __val[2]; }
2025-05-07T20:26:25.6441571Z #define cudaMemAttachHost 0x02
2025-05-07T20:26:25.6441736Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:26:25.6441831Z #define _GLIBCXX_HAVE_ICONV 1
2025-05-07T20:26:25.6441924Z #define _GLIBCXX_SYMVER 1
2025-05-07T20:26:25.6442018Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:26:25.6442106Z #define _WCHAR_T_DECLARED 
2025-05-07T20:26:25.6442226Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:26:25.6442341Z #define isalpha_l(c,l) __isalpha_l ((c), (l))
2025-05-07T20:26:25.6442445Z #define __cpp_inline_variables 201606L
2025-05-07T20:26:25.6442539Z #define WNOWAIT 0x01000000
2025-05-07T20:26:25.6442626Z #define PLOSS 6
2025-05-07T20:26:25.6442716Z #define M_LN10 2.30258509299404568402
2025-05-07T20:26:25.6442981Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626)
2025-05-07T20:26:25.6443065Z #define EXIT_SUCCESS 0
2025-05-07T20:26:25.6443163Z #define __LDBL_REDIR_DECL(name) 
2025-05-07T20:26:25.6443263Z #define _GLIBCXX_HAVE_STRTOF 1
2025-05-07T20:26:25.6443360Z #define MOD_FREQUENCY ADJ_FREQUENCY
2025-05-07T20:26:25.6443456Z #define __thread__ __thread
2025-05-07T20:26:25.6443552Z #define _GLIBCXX_HAVE_MEMORY_H 1
2025-05-07T20:26:25.6443644Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:26:25.6443748Z #define __SIZEOF_PTHREAD_BARRIER_T 32
2025-05-07T20:26:25.6443975Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:26:25.6444090Z #define __cudaCDP2StreamWaitEvent_ptsz 
2025-05-07T20:26:25.6444185Z #define _GLIBCXX_HAVE_SINF 1
2025-05-07T20:26:25.6444267Z #define __linux__ 1
2025-05-07T20:26:25.6444367Z #define STA_PPSSIGNAL 0x0100
2025-05-07T20:26:25.6444498Z #define M_LN2l 0.693147180559945309417232121458176568L
2025-05-07T20:26:25.6444589Z #define __S16_TYPE short int
2025-05-07T20:26:25.6444934Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable()
2025-05-07T20:26:25.6445045Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1
2025-05-07T20:26:25.6445233Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1)
2025-05-07T20:26:25.6445334Z #define __COMMON_FUNCTIONS_H__ 
2025-05-07T20:26:25.6445431Z #define UINT_MAX (INT_MAX * 2U + 1U)
2025-05-07T20:26:25.6445511Z #define _T_SIZE_ 
2025-05-07T20:26:25.6445609Z #define LLONG_MAX __LONG_LONG_MAX__
2025-05-07T20:26:25.6445727Z #define __cudaCDP2StreamCreateWithFlags 
2025-05-07T20:26:25.6445817Z #define _PSTL_VERSION 12000
2025-05-07T20:26:25.6445941Z #define __noinline__ __attribute__((noinline))
2025-05-07T20:26:25.6446033Z #define __WNOTHREAD 0x20000000
2025-05-07T20:26:25.6446221Z #define _G_va_list __gnuc_va_list
2025-05-07T20:26:25.6446351Z #define M_PI_4l 0.785398163397448309615660845819875721L
2025-05-07T20:26:25.6446435Z #define _IOS_INPUT 1
2025-05-07T20:26:25.6446536Z #define __USE_LARGEFILE64 1
2025-05-07T20:26:25.6446654Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1
2025-05-07T20:26:25.6446841Z #define __INT64_TYPE__ long int
2025-05-07T20:26:25.6446949Z #define _POSIX_SSIZE_MAX 32767
2025-05-07T20:26:25.6447047Z #define __shared__ __location__(shared)
2025-05-07T20:26:25.6447137Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:26:25.6447295Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0)
2025-05-07T20:26:25.6447381Z #define __gid_t_defined 
2025-05-07T20:26:25.6447491Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1
2025-05-07T20:26:25.6447593Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:26:25.6447787Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 
2025-05-07T20:26:25.6447889Z #define _GLIBCXX17_INLINE inline
2025-05-07T20:26:25.6447982Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:26:25.6448066Z #define ___int_size_t_h 
2025-05-07T20:26:25.6448173Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:25.6448290Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:26:25.6448444Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED)
2025-05-07T20:26:25.6448558Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1
2025-05-07T20:26:25.6448653Z #define _GLIBCXX_HAVE_FENV_H 1
2025-05-07T20:26:25.6448751Z #define _GLIBCXX_HAVE_STDBOOL_H 1
2025-05-07T20:26:25.6448848Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:26:25.6448970Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:25.6449084Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1
2025-05-07T20:26:25.6449200Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 
2025-05-07T20:26:25.6449288Z #define __clock_t_defined 1
2025-05-07T20:26:25.6449391Z #define _POSIX_SEM_VALUE_MAX 32767
2025-05-07T20:26:25.6449497Z #define __cudaCDP2RuntimeGetVersion 
2025-05-07T20:26:25.6449584Z #define __GLIBC_MINOR__ 17
2025-05-07T20:26:25.6449685Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:26:25.6449782Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:26:25.6449888Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:26:25.6449979Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:26:25.6450148Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:26:25.6450235Z #define __SSE__ 1
2025-05-07T20:26:25.6450333Z #define SEM_VALUE_MAX (2147483647)
2025-05-07T20:26:25.6450428Z #define M_SQRT1_2 0.70710678118654752440
2025-05-07T20:26:25.6450514Z #define _CTYPE_H 1
2025-05-07T20:26:25.6450603Z #define __sigset_t_defined 
2025-05-07T20:26:25.6450696Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:26:25.6450793Z #define _GLIBCXX_HAVE_LOGF 1
2025-05-07T20:26:25.6450876Z #define MOD_TAI ADJ_TAI
2025-05-07T20:26:25.6450971Z #define _IO_va_list __gnuc_va_list
2025-05-07T20:26:25.6451066Z #define _GLIBCXX_HAVE_LOGL 1
2025-05-07T20:26:25.6451148Z #define __SM_70_RT_H__ 
2025-05-07T20:26:25.6451248Z #define _GLIBCXX_HAVE_WRITEV 1
2025-05-07T20:26:25.6451358Z #define cudaEventWaitDefault 0x00
2025-05-07T20:26:25.6451450Z #define _GLIBCXX_HAVE_EXPL 1
2025-05-07T20:26:25.6451610Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:26:25.6451707Z #define _POSIX_MAX_CANON 255
2025-05-07T20:26:25.6451818Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE
2025-05-07T20:26:25.6451912Z #define FD_SETSIZE __FD_SETSIZE
2025-05-07T20:26:25.6452000Z #define _GLIBCXX_TXN_SAFE 
2025-05-07T20:26:25.6452078Z #define __amd64__ 1
2025-05-07T20:26:25.6452171Z #define __WINT_WIDTH__ 32
2025-05-07T20:26:25.6452272Z #define __CUDA_DEVICE_RUNTIME_API_H__ 
2025-05-07T20:26:25.6452533Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:25.6452635Z #define _GLIBCXX_STDIO_SEEK_CUR 1
2025-05-07T20:26:25.6452716Z #define EOF (-1)
2025-05-07T20:26:25.6452809Z #define __WAIT_STATUS_DEFN void *
2025-05-07T20:26:25.6452904Z #define __USE_POSIX199309 1
2025-05-07T20:26:25.6453085Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:26:25.6453184Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:26:25.6453279Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:26:25.6453374Z #define LLONG_MIN (-LLONG_MAX-1)
2025-05-07T20:26:25.6453489Z #define cudaSurfaceType2DLayered 0xF2
2025-05-07T20:26:25.6453656Z #define ____mbstate_t_defined 1
2025-05-07T20:26:25.6453740Z #define STA_NANO 0x2000
2025-05-07T20:26:25.6453844Z #define _GLIBCXX_HAVE_LOG10F 1
2025-05-07T20:26:25.6453936Z #define _GLIBCXX_HAVE_LOG10L 1
2025-05-07T20:26:25.6454021Z #define _IO_LINKED 0x80
2025-05-07T20:26:25.6454122Z #define __cpp_lib_launder 201606
2025-05-07T20:26:25.6454210Z #define __SIZEOF_INT128__ 16
2025-05-07T20:26:25.6454309Z #define __PTHREAD_MUTEX_HAVE_PREV 1
2025-05-07T20:26:25.6458761Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:26:25.6458882Z #define _GLIBCXX_TYPE_TRAITS 1
2025-05-07T20:26:25.6459027Z #define cudaGraphKernelNodePortProgrammatic 1
2025-05-07T20:26:25.6459146Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:25.6459248Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE
2025-05-07T20:26:25.6459341Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:26:25.6459437Z #define __W_CONTINUED 0xffff
2025-05-07T20:26:25.6459524Z #define __ATOMIC_RELAXED 0
2025-05-07T20:26:25.6459655Z #define w_coredump __wait_terminated.__w_coredump
2025-05-07T20:26:25.6459785Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:25.6459990Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 
2025-05-07T20:26:25.6460170Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:26:25.6460258Z #define __stub_stty 
2025-05-07T20:26:25.6460423Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)])
2025-05-07T20:26:25.6460508Z #define le16toh(x) (x)
2025-05-07T20:26:25.6460616Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX
2025-05-07T20:26:25.6460790Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:26:25.6460875Z #define _SIZET_ 
2025-05-07T20:26:25.6460967Z #define XATTR_NAME_MAX 255
2025-05-07T20:26:25.6461049Z #define _SVID_SOURCE 1
2025-05-07T20:26:25.6461132Z #define _LP64 1
2025-05-07T20:26:25.6461221Z #define _LIBC_LIMITS_H_ 1
2025-05-07T20:26:25.6461452Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias)
2025-05-07T20:26:25.6461569Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1
2025-05-07T20:26:25.6461652Z #define __UINT8_C(c) c
2025-05-07T20:26:25.6461743Z #define _GLIBCXX_HAVE_CEILF 1
2025-05-07T20:26:25.6461838Z #define _GLIBCXX_HAVE_CEILL 1
2025-05-07T20:26:25.6461943Z #define __cudaCDP2Memset3DAsync_ptsz 
2025-05-07T20:26:25.6462038Z #define __CUDA_ARCH_LIST__ 520
2025-05-07T20:26:25.6462131Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:26:25.6462224Z #define MOD_MAXERROR ADJ_MAXERROR
2025-05-07T20:26:25.6462310Z #define CUDARTAPI 
2025-05-07T20:26:25.6462390Z #define IOV_MAX 1024
2025-05-07T20:26:25.6462532Z #define __glibcxx_requires_irreflexive2(_First,_Last) 
2025-05-07T20:26:25.6462635Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:26:25.6462733Z #define cudaMemAttachSingle 0x04
2025-05-07T20:26:25.6462811Z #define __wchar_t__ 
2025-05-07T20:26:25.6462911Z #define __cpp_lib_is_aggregate 201703
2025-05-07T20:26:25.6462991Z #define SEEK_END 2
2025-05-07T20:26:25.6463079Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:26:25.6463256Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include(<tbb/tbb.h>)
2025-05-07T20:26:25.6463352Z #define _IO_ftrylockfile(_fp) 
2025-05-07T20:26:25.6463493Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR
2025-05-07T20:26:25.6463577Z #define ____FILE_defined 1
2025-05-07T20:26:25.6463690Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1
2025-05-07T20:26:25.6463786Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:26:25.6463870Z #define _ISOC99_SOURCE 1
2025-05-07T20:26:25.6463966Z #define __VECTOR_FUNCTIONS_H__ 
2025-05-07T20:26:25.6464212Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:25.6464472Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 
2025-05-07T20:26:25.6464557Z #define _IO_RIGHT 04
2025-05-07T20:26:25.6464654Z #define __END_NAMESPACE_STD 
2025-05-07T20:26:25.6464837Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:26:25.6464928Z #define _GLIBCXX_STD_C std
2025-05-07T20:26:25.6465116Z #define cudaInitDeviceFlagsAreValid 0x01
2025-05-07T20:26:25.6465209Z #define _LARGEFILE64_SOURCE 1
2025-05-07T20:26:25.6465306Z #define _GLIBCXX_USE_C99_STDINT_TR1 1
2025-05-07T20:26:25.6465385Z #define _STDDEF_H_ 
2025-05-07T20:26:25.6465561Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:26:25.6465656Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:26:25.6465771Z #define isalnum_l(c,l) __isalnum_l ((c), (l))
2025-05-07T20:26:25.6465964Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0)
2025-05-07T20:26:25.6466076Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:25.6466223Z #define __glibcxx_requires_irreflexive(_First,_Last) 
2025-05-07T20:26:25.6466342Z #define cudaGraphKernelNodePortDefault 0
2025-05-07T20:26:25.6466440Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:26:25.6466550Z #define __cudaCDP2Memcpy3DAsync_ptsz 
2025-05-07T20:26:25.6466641Z #define __PID_T_TYPE __S32_TYPE
2025-05-07T20:26:25.6466754Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:26:25.6466849Z #define CHARCLASS_NAME_MAX 2048
2025-05-07T20:26:25.6466942Z #define _GLIBCXX_HAVE_TANF 1
2025-05-07T20:26:25.6467039Z #define _GLIBCXX_USE_ST_MTIM 1
2025-05-07T20:26:25.6467210Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:26:25.6467298Z #define __CUDA_RUNTIME_H__ 
2025-05-07T20:26:25.6467476Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status))
2025-05-07T20:26:25.6467571Z #define _GLIBCXX_HAVE_STDLIB_H 1
2025-05-07T20:26:25.6467663Z #define __STDCPP_THREADS__ 1
2025-05-07T20:26:25.6467805Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L
2025-05-07T20:26:25.6467902Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:26:25.6467991Z #define _POSIX_UIO_MAXIOV 16
2025-05-07T20:26:25.6468090Z #define _PSTL_PAR_BACKEND_SERIAL 
2025-05-07T20:26:25.6468181Z #define P_tmpdir "/tmp"
2025-05-07T20:26:25.6468303Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__
2025-05-07T20:26:25.6468398Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:26:25.6468495Z #define __WORDSIZE_TIME64_COMPAT32 1
2025-05-07T20:26:25.6468658Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__))
2025-05-07T20:26:25.6468824Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:26:25.6468919Z #define _PSTL_HIDE_FROM_ABI_PUSH 
2025-05-07T20:26:25.6469044Z #define cudaStreamLegacy ((cudaStream_t)0x1)
2025-05-07T20:26:25.6469151Z #define _IO_cleanup_region_start(_fct,_fp) 
2025-05-07T20:26:25.6469249Z #define __location__(a) __annotate__(a)
2025-05-07T20:26:25.6469477Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type)
2025-05-07T20:26:25.6469574Z #define _POSIX2_BC_BASE_MAX 99
2025-05-07T20:26:25.6469686Z #define __cudaCDP2DeviceGetAttribute 
2025-05-07T20:26:25.6469781Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:26:25.6469940Z #define __STDC_UTF_32__ 1
2025-05-07T20:26:25.6470050Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:26:25.6470148Z #define NAN (__builtin_nanf (""))
2025-05-07T20:26:25.6470242Z #define _POSIX_MQ_PRIO_MAX 32
2025-05-07T20:26:25.6470325Z #define __FXSR__ 1
2025-05-07T20:26:25.6470403Z #define _SIZE_T 
2025-05-07T20:26:25.6470503Z #define _GLIBCXX_USE_GETTIMEOFDAY 1
2025-05-07T20:26:25.6470612Z #define cudaHostRegisterReadOnly 0x08
2025-05-07T20:26:25.6470777Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:26:25.6470923Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f)
2025-05-07T20:26:25.6471019Z #define _IO_ssize_t __ssize_t
2025-05-07T20:26:25.6471115Z #define __ULONG32_TYPE unsigned int
2025-05-07T20:26:25.6471299Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:26:25.6471608Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000
2025-05-07T20:26:25.6471698Z #define _GXX_NULLPTR_T 
2025-05-07T20:26:25.6471821Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 
2025-05-07T20:26:25.6471905Z #define FOPEN_MAX 16
2025-05-07T20:26:25.6472060Z #define __BIG_ENDIAN 4321
2025-05-07T20:26:25.6472181Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:26:25.6472276Z #define __suseconds_t_defined 
2025-05-07T20:26:25.6472361Z #define __off_t_defined 
2025-05-07T20:26:25.6472446Z #define stderr stderr
2025-05-07T20:26:25.6472536Z #define M_LOG10E 0.43429448190325182765
2025-05-07T20:26:25.6472652Z #define __glibcxx_requires_string(_String) 
2025-05-07T20:26:25.6472744Z #define _GLIBCXX_HAVE_LDEXPL 1
2025-05-07T20:26:25.6472836Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:26:25.6473245Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304)
2025-05-07T20:26:25.6473338Z #define __mode_t_defined 
2025-05-07T20:26:25.6473418Z #define _GCC_SIZE_T 
2025-05-07T20:26:25.6473517Z #define __INO64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:25.6473617Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:26:25.6473722Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:26:25.6473823Z #define __USE_XOPEN2K8XSI 1
2025-05-07T20:26:25.6473911Z #define __UINT32_C(c) c ## U
2025-05-07T20:26:25.6474015Z #define __cpp_alias_templates 200704L
2025-05-07T20:26:25.6474118Z #define cudaHostAllocMapped 0x02
2025-05-07T20:26:25.6474220Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 
2025-05-07T20:26:25.6474310Z #define _STL_ITERATOR_H 1
2025-05-07T20:26:25.6474389Z #define __size_t__ 
2025-05-07T20:26:25.6474515Z #define cudaStreamAttrID cudaLaunchAttributeID
2025-05-07T20:26:25.6474608Z #define _GLIBCXX_HAVE_ATANF 1
2025-05-07T20:26:25.6474720Z #define cudaEventRecordExternal 0x01
2025-05-07T20:26:25.6474865Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l))
2025-05-07T20:26:25.6474964Z #define _IO_BUFSIZ _G_BUFSIZ
2025-05-07T20:26:25.6475129Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:26:25.6475212Z #define _ENDIAN_H 1
2025-05-07T20:26:25.6475316Z #define __builtin_align__(a) __align__(a)
2025-05-07T20:26:25.6475407Z #define _GLIBCXX20_CONSTEXPR 
2025-05-07T20:26:25.6475517Z #define __NV_NO_HOST_COMPILER_CHECK 1
2025-05-07T20:26:25.6475595Z #define __try try
2025-05-07T20:26:25.6475687Z #define _GLIBCXX_HAVE_FINITE 1
2025-05-07T20:26:25.6475782Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:26:25.6475867Z #define __INT8_MAX__ 0x7f
2025-05-07T20:26:25.6476125Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2)
2025-05-07T20:26:25.6476216Z #define __LONG_WIDTH__ 64
2025-05-07T20:26:25.6476292Z #define __PIC__ 2
2025-05-07T20:26:25.6476398Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX
2025-05-07T20:26:25.6476517Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:26:25.6476650Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp)
2025-05-07T20:26:25.6476745Z #define _GLIBCXX_HAVE_FLOAT_H 1
2025-05-07T20:26:25.6476837Z #define _GLIBCXX_HAVE_ATANL 1
2025-05-07T20:26:25.6477017Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:26:25.6477116Z #define __DEVICE_FUNCTIONS_HPP__ 
2025-05-07T20:26:25.6477218Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:26:25.6477304Z #define _IO_uid_t __uid_t
2025-05-07T20:26:25.6477405Z #define _GLIBCXX_HAVE_READLINK 1
2025-05-07T20:26:25.6477528Z #define __cudaCDP2EventRecordWithFlags_ptsz 
2025-05-07T20:26:25.6477616Z #define _CONCEPT_CHECK_H 1
2025-05-07T20:26:25.6477762Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:26:25.6477858Z #define _GLIBCXX_HAVE_NETINET_IN_H 1
2025-05-07T20:26:25.6477977Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1
2025-05-07T20:26:25.6478063Z #define LONG_BIT 64
2025-05-07T20:26:25.6478166Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4
2025-05-07T20:26:25.6478349Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1
2025-05-07T20:26:25.6478477Z #define __cpp_lib_math_special_functions 201603L
2025-05-07T20:26:25.6478571Z #define __fsfilcnt_t_defined 
2025-05-07T20:26:25.6478664Z #define __blkcnt_t_defined 
2025-05-07T20:26:25.6478937Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:26:25.6479172Z #define __USE_LARGEFILE 1
2025-05-07T20:26:25.6479272Z #define __cpp_constexpr 201603L
2025-05-07T20:26:25.6479362Z #define CUDART_VERSION 12060
2025-05-07T20:26:25.6479446Z #define NL_TEXTMAX INT_MAX
2025-05-07T20:26:25.6479549Z #define cudaDeviceMapHost 0x08
2025-05-07T20:26:25.6479636Z #define _GLIBCXX_CMATH 1
2025-05-07T20:26:25.6479830Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x)))
2025-05-07T20:26:25.6479919Z #define __lldiv_t_defined 1
2025-05-07T20:26:25.6479998Z #define __SSE2__ 1
2025-05-07T20:26:25.6480081Z #define _IOLBF 1
2025-05-07T20:26:25.6480180Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1
2025-05-07T20:26:25.6480277Z #define _GLIBCXX_HAVE_FLOORF 1
2025-05-07T20:26:25.6480383Z #define __cpp_deduction_guides 201703L
2025-05-07T20:26:25.6480475Z #define _GLIBCXX_HAVE_EXPF 1
2025-05-07T20:26:25.6480579Z #define __annotate__(a) __attribute__((a))
2025-05-07T20:26:25.6480670Z #define __INT32_TYPE__ int
2025-05-07T20:26:25.6480764Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:26:25.6480867Z #define cudaDeviceSyncMemops 0x80
2025-05-07T20:26:25.6480965Z #define __cpp_exceptions 199711L
2025-05-07T20:26:25.6481058Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:26:25.6481171Z #define cudaDeviceScheduleYield 0x02
2025-05-07T20:26:25.6481259Z #define _SYS_SYSMACROS_H 1
2025-05-07T20:26:25.6481371Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1
2025-05-07T20:26:25.6481529Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:26:25.6481621Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:26:25.6481714Z #define __SWORD_TYPE long int
2025-05-07T20:26:25.6481807Z #define __INTMAX_TYPE__ long int
2025-05-07T20:26:25.6481904Z #define _GLIBCXX11_USE_C99_MATH 1
2025-05-07T20:26:25.6481996Z #define __PTHREAD_SPINS 0, 0
2025-05-07T20:26:25.6482090Z #define _BITS_POSIX1_LIM_H 1
2025-05-07T20:26:25.6482369Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:26:25.6482461Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:26:25.6482615Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT)
2025-05-07T20:26:25.6482689Z #define _T_SIZE 
2025-05-07T20:26:25.6482795Z #define cudaHostAllocDefault 0x00
2025-05-07T20:26:25.6482915Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 
2025-05-07T20:26:25.6483036Z #define __va_arg_pack() __builtin_va_arg_pack ()
2025-05-07T20:26:25.6483131Z #define _POSIX_TIMER_MAX 32
2025-05-07T20:26:25.6483219Z #define _GLIBCXX_HAVE_TLS 1
2025-05-07T20:26:25.6483338Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT
2025-05-07T20:26:25.6483431Z #define _GLIBCXX_HAVE_ACOSL 1
2025-05-07T20:26:25.6483526Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:26:25.6483615Z #define __ATOMIC_CONSUME 1
2025-05-07T20:26:25.6483800Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT
2025-05-07T20:26:25.6483888Z #define __GNUC_MINOR__ 4
2025-05-07T20:26:25.6483991Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:26:25.6484084Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:26:25.6484203Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:25.6484284Z #define __PIE__ 2
2025-05-07T20:26:25.6484387Z #define LITTLE_ENDIAN __LITTLE_ENDIAN
2025-05-07T20:26:25.6484481Z #define _GLIBCXX_HAVE_INT64_T_LONG 1
2025-05-07T20:26:25.6484670Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:26:25.6484885Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:26:25.6484972Z #define __nlink_t_defined 
2025-05-07T20:26:25.6485099Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]]
2025-05-07T20:26:25.6485209Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x)
2025-05-07T20:26:25.6485296Z #define _XOPEN_LIM_H 1
2025-05-07T20:26:25.6485639Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:26:25.6485757Z #define __cpp_template_template_args 201611L
2025-05-07T20:26:25.6485864Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1
2025-05-07T20:26:25.6486058Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX
2025-05-07T20:26:25.6486151Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:26:25.6486242Z #define __FILE_defined 1
2025-05-07T20:26:25.6486414Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:26:25.6486509Z #define _GLIBCXX_HAVE_SINCOS 1
2025-05-07T20:26:25.6486602Z #define __USE_XOPEN_EXTENDED 1
2025-05-07T20:26:25.6486707Z #define __cpp_lib_tuple_element_t 201402L
2025-05-07T20:26:25.6486823Z #define isascii_l(c,l) __isascii_l ((c), (l))
2025-05-07T20:26:25.6486926Z #define cudaInvalidDeviceId ((int)-2)
2025-05-07T20:26:25.6487022Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1
2025-05-07T20:26:25.6487108Z #define __INT16_C(c) c
2025-05-07T20:26:25.6487206Z #define __U32_TYPE unsigned int
2025-05-07T20:26:25.6487302Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1
2025-05-07T20:26:25.6487427Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp)
2025-05-07T20:26:25.6487503Z #define __STDC__ 1
2025-05-07T20:26:25.6487595Z #define _GLIBCXX_HAVE_VWSCANF 1
2025-05-07T20:26:25.6487698Z #define _GLIBCXX_HAVE_EXECINFO_H 1
2025-05-07T20:26:25.6487792Z #define _GLIBCXX_USE_REALPATH 1
2025-05-07T20:26:25.6487940Z #define __attribute_malloc__ __attribute__ ((__malloc__))
2025-05-07T20:26:25.6488028Z #define __FLT32X_DIG__ 15
2025-05-07T20:26:25.6488126Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1
2025-05-07T20:26:25.6488226Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:26:25.6488334Z #define cudaArrayDeferredMapping 0x80
2025-05-07T20:26:25.6488438Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 
2025-05-07T20:26:25.6488534Z #define USHRT_MAX (SHRT_MAX * 2 + 1)
2025-05-07T20:26:25.6488635Z #define __cpp_lib_is_swappable 201603
2025-05-07T20:26:25.6488715Z #define stdin stdin
2025-05-07T20:26:25.6488810Z #define __ino64_t_defined 
2025-05-07T20:26:25.6488895Z #define STA_CLK 0x8000
2025-05-07T20:26:25.6488988Z #define __clockid_t_defined 1
2025-05-07T20:26:25.6489137Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__)
2025-05-07T20:26:25.6489298Z #define __attribute_noinline__ __attribute__ ((__noinline__))
2025-05-07T20:26:25.6489408Z #define __cudaCDP2MemsetAsync 
2025-05-07T20:26:25.6489511Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 
2025-05-07T20:26:25.6489609Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 
2025-05-07T20:26:25.6489711Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1
2025-05-07T20:26:25.6489905Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d)))
2025-05-07T20:26:25.6489994Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:26:25.6490514Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; }))
2025-05-07T20:26:25.6490598Z #define DOMAIN 1
2025-05-07T20:26:25.6490686Z #define M_LN2 0.69314718055994530942
2025-05-07T20:26:25.6490769Z #define __NVCC__ 1
2025-05-07T20:26:25.6490869Z #define __cudaCDP2Memset2DAsync 
2025-05-07T20:26:25.6490986Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:25.6491093Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 
2025-05-07T20:26:25.6491192Z #define __throw_exception_again throw
2025-05-07T20:26:25.6491286Z #define M_SQRT2 1.41421356237309504880
2025-05-07T20:26:25.6491372Z #define __EXCEPTION_H 1
2025-05-07T20:26:25.6491468Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:26:25.6491571Z #define HUGE_VAL (__builtin_huge_val())
2025-05-07T20:26:25.6491871Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:26:25.6491982Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:26:25.6492084Z #define _GLIBCXX_INLINE_VERSION 0
2025-05-07T20:26:25.6492177Z #define _GLIBCXX_USE_INT128 1
2025-05-07T20:26:25.6492364Z #define __cpp_lib_bool_constant 201505
2025-05-07T20:26:25.6492464Z #define PTHREAD_KEYS_MAX 1024
2025-05-07T20:26:25.6492604Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:26:25.6492707Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:25.6492814Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1
2025-05-07T20:26:25.6492977Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:26:25.6493082Z #define __cpp_lib_tuples_by_type 201304
2025-05-07T20:26:25.6493174Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:26:25.6493273Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:26:25.6493411Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC))
2025-05-07T20:26:25.6493506Z #define __useconds_t_defined 
2025-05-07T20:26:25.6493609Z #define _GLIBCXX_USE_SCHED_YIELD 1
2025-05-07T20:26:25.6493790Z #define __attribute_deprecated__ __attribute__ ((__deprecated__))
2025-05-07T20:26:25.6493934Z #define __cpp_lib_type_trait_variable_templates 201510L
2025-05-07T20:26:25.6494021Z #define __SSE_MATH__ 1
2025-05-07T20:26:25.6494113Z #define _IO_wint_t wint_t
2025-05-07T20:26:25.6494210Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:26:25.6494302Z #define _GLIBCXX_VERBOSE 1
2025-05-07T20:26:25.6494394Z #define _GLIBCXX_HAVE_ASINF 1
2025-05-07T20:26:25.6494504Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:26:25.6494606Z #define _GLIBCXX_HAVE_ISINFL 1
2025-05-07T20:26:25.6494698Z #define _GLIBCXX_HAVE_ASINL 1
2025-05-07T20:26:25.6494781Z #define __USE_ATFILE 1
2025-05-07T20:26:25.6494873Z #define _POSIX_OPEN_MAX 20
2025-05-07T20:26:25.6494967Z #define _POSIX_LOGIN_NAME_MAX 9
2025-05-07T20:26:25.6495056Z #define _GCC_PTRDIFF_T 
2025-05-07T20:26:25.6495279Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority
2025-05-07T20:26:25.6495373Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:26:25.6495473Z #define _POSIX_THREAD_KEYS_MAX 128
2025-05-07T20:26:25.6495573Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:26:25.6495680Z #define __cpp_lib_array_constexpr 201803L
2025-05-07T20:26:25.6495765Z #define _STDLIB_H 1
2025-05-07T20:26:25.6495901Z #define __exctype(name) extern int name (int) __THROW
2025-05-07T20:26:25.6495993Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:26:25.6496088Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:26:25.6496216Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:25.6496334Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:26:25.6496424Z #define __SM_61_INTRINSICS_H__ 
2025-05-07T20:26:25.6496604Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused"
2025-05-07T20:26:25.6496759Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l))
2025-05-07T20:26:25.6496861Z #define __glibcxx_requires_nonempty() 
2025-05-07T20:26:25.6496976Z #define w_stopsig __wait_stopped.__w_stopsig
2025-05-07T20:26:25.6497067Z #define __ldiv_t_defined 1
2025-05-07T20:26:25.6497243Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 
2025-05-07T20:26:25.6497332Z #define ___int_ptrdiff_t_h 
2025-05-07T20:26:25.6497507Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:26:25.6497607Z #define __cudaCDP2EventDestroy 
2025-05-07T20:26:25.6497700Z #define __HOST_DEFINES_H__ 
2025-05-07T20:26:25.6497802Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:26:25.6497898Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:25.6498003Z #define _GLIBCXX_USE_NANOSLEEP 1
2025-05-07T20:26:25.6498082Z #define CUDART_CB 
2025-05-07T20:26:25.6498180Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX
2025-05-07T20:26:25.6498305Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1
2025-05-07T20:26:25.6498389Z #define MB_LEN_MAX 16
2025-05-07T20:26:25.6498610Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:26:25.6498712Z #define _GLIBCXX11_USE_C99_WCHAR 1
2025-05-07T20:26:25.6498834Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp)
2025-05-07T20:26:25.6498943Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1
2025-05-07T20:26:25.6499040Z #define _GLIBCXX_HAVE_UNISTD_H 1
2025-05-07T20:26:25.6499270Z #define __glibc_likely(cond) __builtin_expect((cond), 1)
2025-05-07T20:26:25.6499380Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:26:25.6499464Z #define _GNU_SOURCE 1
2025-05-07T20:26:25.6499548Z #define __stub_putmsg 
2025-05-07T20:26:25.6499633Z #define __CUDACC__ 1
2025-05-07T20:26:25.6499794Z #define __N(msgid) (msgid)
2025-05-07T20:26:25.6499877Z #define __P(args) args
2025-05-07T20:26:25.6500133Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative
2025-05-07T20:26:25.6500233Z #define __cpp_init_captures 201304L
2025-05-07T20:26:25.6500336Z #define _GLIBCXX17_CONSTEXPR constexpr
2025-05-07T20:26:25.6500428Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:26:25.6500526Z #define __cpp_lib_as_const 201510
2025-05-07T20:26:25.6500604Z #define __WCHAR_T 
2025-05-07T20:26:25.6500696Z #define __ATOMIC_RELEASE 3
2025-05-07T20:26:25.6500787Z #define __fsblkcnt_t_defined 
2025-05-07T20:26:25.6500904Z #define __cudaCDP2EventCreateWithFlags 
2025-05-07T20:26:25.6501008Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 
2025-05-07T20:26:25.6501014Z 
2025-05-07T20:26:25.6664508Z 
2025-05-07T20:26:25.6664925Z + conda run -n build_binary nvcc --version
2025-05-07T20:26:25.6664938Z 
2025-05-07T20:26:27.5645739Z nvcc: NVIDIA (R) Cuda compiler driver
2025-05-07T20:26:27.5646239Z Copyright (c) 2005-2024 NVIDIA Corporation
2025-05-07T20:26:27.5646651Z Built on Tue_Oct_29_23:50:19_PDT_2024
2025-05-07T20:26:27.5646958Z Cuda compilation tools, release 12.6, V12.6.85
2025-05-07T20:26:27.5647315Z Build cuda_12.6.r12.6/compiler.35059454_0
2025-05-07T20:26:27.5647527Z 
2025-05-07T20:26:27.6272200Z 
2025-05-07T20:26:27.6282201Z /usr/bin/nvidia-smi
2025-05-07T20:26:27.6287660Z + nvidia-smi
2025-05-07T20:26:27.6287856Z 
2025-05-07T20:26:27.6460336Z Wed May  7 20:26:27 2025       
2025-05-07T20:26:27.6460821Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:26:27.6461495Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:26:27.6462001Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:26:27.6462493Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:26:27.6463006Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:26:27.6463436Z |                                         |                        |               MIG M. |
2025-05-07T20:26:27.6463764Z |=========================================+========================+======================|
2025-05-07T20:26:27.6631673Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:26:27.6632249Z |  0%   25C    P8             16W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:26:27.6632756Z |                                         |                        |                  N/A |
2025-05-07T20:26:27.6633202Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:26:27.6636488Z                                                                                          
2025-05-07T20:26:27.6637047Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:26:27.6637689Z | Processes:                                                                              |
2025-05-07T20:26:27.6638233Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:26:27.6638628Z |        ID   ID                                                               Usage      |
2025-05-07T20:26:27.6638970Z |=========================================================================================|
2025-05-07T20:26:27.6641526Z |  No running processes found                                                             |
2025-05-07T20:26:27.6642178Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:26:27.9322647Z 
2025-05-07T20:26:27.9327523Z [INSTALL] Successfully installed CUDA 12.6.3
2025-05-07T20:26:27.9381155Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3
2025-05-07T20:26:27.9381711Z [36;1m. $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3[0m
2025-05-07T20:26:27.9394204Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:26:27.9394549Z env:
2025-05-07T20:26:27.9394768Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:26:27.9395059Z   BUILD_ENV: build_binary
2025-05-07T20:26:27.9395304Z   BUILD_TARGET: genai
2025-05-07T20:26:27.9395531Z   BUILD_VARIANT: cuda
2025-05-07T20:26:27.9395762Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:26:27.9396009Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:26:27.9396337Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:26:27.9396793Z ##[endgroup]
2025-05-07T20:26:28.2751995Z ################################################################################
2025-05-07T20:26:28.2752372Z # Install PyTorch (PIP)
2025-05-07T20:26:28.2752602Z #
2025-05-07T20:26:28.2768413Z # [2025-05-07T20:26:28.276Z] + install_pytorch_pip build_binary nightly cuda/12.6.3
2025-05-07T20:26:28.2768856Z ################################################################################
2025-05-07T20:26:28.2769084Z 
2025-05-07T20:26:28.2798261Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y numpy
2025-05-07T20:26:29.2666751Z Channels:
2025-05-07T20:26:29.2666992Z  - conda-forge
2025-05-07T20:26:29.2667213Z Platform: linux-64
2025-05-07T20:26:32.5763424Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:26:33.3070727Z Solving environment: \ | / done
2025-05-07T20:26:33.5296364Z 
2025-05-07T20:26:33.5296875Z ## Package Plan ##
2025-05-07T20:26:33.5297080Z 
2025-05-07T20:26:33.5297357Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:26:33.5297756Z 
2025-05-07T20:26:33.5297882Z   added / updated specs:
2025-05-07T20:26:33.5298202Z     - numpy
2025-05-07T20:26:33.5298355Z 
2025-05-07T20:26:33.5298382Z 
2025-05-07T20:26:33.5298557Z The following packages will be downloaded:
2025-05-07T20:26:33.5298839Z 
2025-05-07T20:26:33.5299012Z     package                    |            build
2025-05-07T20:26:33.5299428Z     ---------------------------|-----------------
2025-05-07T20:26:33.5299816Z     libblas-3.9.0              |31_h59b9bed_openblas          16 KB  conda-forge
2025-05-07T20:26:33.5300434Z     libcblas-3.9.0             |31_he106b2a_openblas          16 KB  conda-forge
2025-05-07T20:26:33.5300987Z     libgfortran-15.1.0         |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:26:33.5301427Z     libgfortran5-15.1.0        |       hcea5267_2         1.5 MB  conda-forge
2025-05-07T20:26:33.5301887Z     liblapack-3.9.0            |31_h7ac8fdf_openblas          16 KB  conda-forge
2025-05-07T20:26:33.5302361Z     libopenblas-0.3.29         |pthreads_h94d23a6_0         5.6 MB  conda-forge
2025-05-07T20:26:33.5302804Z     numpy-2.0.2                |   py39h9cb892a_1         7.6 MB  conda-forge
2025-05-07T20:26:33.5303192Z     ------------------------------------------------------------
2025-05-07T20:26:33.5303522Z                                            Total:        14.8 MB
2025-05-07T20:26:33.5303991Z 
2025-05-07T20:26:33.5304160Z The following NEW packages will be INSTALLED:
2025-05-07T20:26:33.5304384Z 
2025-05-07T20:26:33.5304593Z   libblas            conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 
2025-05-07T20:26:33.5305161Z   libcblas           conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 
2025-05-07T20:26:33.5314176Z   libgfortran        conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 
2025-05-07T20:26:33.5314827Z   libgfortran5       conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 
2025-05-07T20:26:33.5315362Z   liblapack          conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 
2025-05-07T20:26:33.5315987Z   libopenblas        conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 
2025-05-07T20:26:33.5316979Z   numpy              conda-forge/linux-64::numpy-2.0.2-py39h9cb892a_1 
2025-05-07T20:26:33.5317302Z 
2025-05-07T20:26:33.5317307Z 
2025-05-07T20:26:33.5317323Z 
2025-05-07T20:26:33.5317473Z Downloading and Extracting Packages: ...working...
2025-05-07T20:26:33.5318032Z numpy-2.0.2          | 7.6 MB    |            |   0% 
2025-05-07T20:26:33.5318276Z 
2025-05-07T20:26:33.5318581Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:26:33.5318834Z 
2025-05-07T20:26:33.5318838Z 
2025-05-07T20:26:33.5328085Z libgfortran5-15.1.0  | 1.5 MB    |            |   0% [A[A
2025-05-07T20:26:33.5328461Z 
2025-05-07T20:26:33.5328466Z 
2025-05-07T20:26:33.5328471Z 
2025-05-07T20:26:33.5336206Z libgfortran-15.1.0   | 34 KB     |            |   0% [A[A[A
2025-05-07T20:26:33.5336590Z 
2025-05-07T20:26:33.5336596Z 
2025-05-07T20:26:33.5336601Z 
2025-05-07T20:26:33.5336607Z 
2025-05-07T20:26:33.5360796Z libblas-3.9.0        | 16 KB     |            |   0% [A[A[A[A
2025-05-07T20:26:33.5361165Z 
2025-05-07T20:26:33.5361170Z 
2025-05-07T20:26:33.5361186Z 
2025-05-07T20:26:33.5361191Z 
2025-05-07T20:26:33.5361196Z 
2025-05-07T20:26:33.5362177Z libcblas-3.9.0       | 16 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:26:33.5362537Z 
2025-05-07T20:26:33.5362555Z 
2025-05-07T20:26:33.5362566Z 
2025-05-07T20:26:33.5362571Z 
2025-05-07T20:26:33.5362577Z 
2025-05-07T20:26:33.5364015Z 
2025-05-07T20:26:33.6946733Z liblapack-3.9.0      | 16 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:26:33.6947141Z 
2025-05-07T20:26:33.6947145Z 
2025-05-07T20:26:33.6947149Z 
2025-05-07T20:26:33.6976854Z 
2025-05-07T20:26:33.6977882Z libblas-3.9.0        | 16 KB     | #########7 |  97% [A[A[A[A
2025-05-07T20:26:33.6978158Z 
2025-05-07T20:26:33.6981413Z 
2025-05-07T20:26:33.6988416Z libgfortran5-15.1.0  | 1.5 MB    | 1          |   1% [A[A
2025-05-07T20:26:33.6988840Z 
2025-05-07T20:26:33.6988846Z 
2025-05-07T20:26:33.6991881Z 
2025-05-07T20:26:33.7022716Z libgfortran-15.1.0   | 34 KB     | #########4 |  95% [A[A[A
2025-05-07T20:26:33.7023114Z 
2025-05-07T20:26:33.7023121Z 
2025-05-07T20:26:33.7023126Z 
2025-05-07T20:26:33.7023131Z 
2025-05-07T20:26:33.7075946Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:26:33.7076400Z 
2025-05-07T20:26:33.7076406Z 
2025-05-07T20:26:33.7078612Z 
2025-05-07T20:26:33.7400730Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:26:33.7401099Z 
2025-05-07T20:26:33.7401105Z 
2025-05-07T20:26:33.7401110Z 
2025-05-07T20:26:33.7401116Z 
2025-05-07T20:26:33.7401121Z 
2025-05-07T20:26:33.7401138Z 
2025-05-07T20:26:33.7422141Z liblapack-3.9.0      | 16 KB     | #########7 |  98% [A[A[A[A[A[A
2025-05-07T20:26:33.7422487Z 
2025-05-07T20:26:33.7422491Z 
2025-05-07T20:26:33.7422495Z 
2025-05-07T20:26:33.7422499Z 
2025-05-07T20:26:33.7422512Z 
2025-05-07T20:26:33.7422516Z 
2025-05-07T20:26:33.7436247Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:33.7436614Z 
2025-05-07T20:26:33.7436618Z 
2025-05-07T20:26:33.7436640Z 
2025-05-07T20:26:33.7439606Z 
2025-05-07T20:26:33.7583641Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:26:33.7583922Z 
2025-05-07T20:26:33.7586764Z 
2025-05-07T20:26:33.7617948Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:26:33.7618221Z 
2025-05-07T20:26:33.7618234Z 
2025-05-07T20:26:33.7618238Z 
2025-05-07T20:26:33.7618242Z 
2025-05-07T20:26:33.7623387Z 
2025-05-07T20:26:33.7646700Z libcblas-3.9.0       | 16 KB     | #########7 |  98% [A[A[A[A[A
2025-05-07T20:26:33.7646977Z 
2025-05-07T20:26:33.7646988Z 
2025-05-07T20:26:33.7646992Z 
2025-05-07T20:26:33.7646996Z 
2025-05-07T20:26:33.7648437Z 
2025-05-07T20:26:33.7677505Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:33.7678668Z 
2025-05-07T20:26:33.7759356Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:26:33.7759722Z 
2025-05-07T20:26:33.7759728Z 
2025-05-07T20:26:33.7761807Z 
2025-05-07T20:26:33.7887499Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:26:33.7887942Z 
2025-05-07T20:26:33.7887948Z 
2025-05-07T20:26:33.7887954Z 
2025-05-07T20:26:33.7887959Z 
2025-05-07T20:26:33.7888180Z 
2025-05-07T20:26:33.7889689Z 
2025-05-07T20:26:33.7928823Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:33.8293688Z numpy-2.0.2          | 7.6 MB    |            |   0% 
2025-05-07T20:26:33.8293954Z 
2025-05-07T20:26:33.8293958Z 
2025-05-07T20:26:33.8293962Z 
2025-05-07T20:26:33.8293966Z 
2025-05-07T20:26:33.8294068Z 
2025-05-07T20:26:33.8693162Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:33.8693569Z 
2025-05-07T20:26:33.8693576Z 
2025-05-07T20:26:33.8694082Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:26:33.8694404Z 
2025-05-07T20:26:33.8694408Z 
2025-05-07T20:26:33.8720470Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:26:33.8720805Z 
2025-05-07T20:26:33.8723660Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:26:33.8726928Z 
2025-05-07T20:26:33.8928796Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:26:33.9066361Z numpy-2.0.2          | 7.6 MB    | ########8  |  88% 
2025-05-07T20:26:34.0196256Z numpy-2.0.2          | 7.6 MB    | ########## | 100% 
2025-05-07T20:26:34.0196545Z 
2025-05-07T20:26:34.3437976Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:26:34.3443136Z numpy-2.0.2          | 7.6 MB    | ########## | 100% 
2025-05-07T20:26:34.3443469Z                                                      
2025-05-07T20:26:34.3443727Z 
2025-05-07T20:26:34.3444019Z                                                      [A
2025-05-07T20:26:34.3444257Z 
2025-05-07T20:26:34.3444262Z 
2025-05-07T20:26:34.3444429Z                                                      [A[A
2025-05-07T20:26:34.3444631Z 
2025-05-07T20:26:34.3444635Z 
2025-05-07T20:26:34.3444639Z 
2025-05-07T20:26:34.3444834Z                                                      [A[A[A
2025-05-07T20:26:34.3445075Z 
2025-05-07T20:26:34.3445081Z 
2025-05-07T20:26:34.3445087Z 
2025-05-07T20:26:34.3445092Z 
2025-05-07T20:26:34.3445349Z                                                      [A[A[A[A
2025-05-07T20:26:34.3445557Z 
2025-05-07T20:26:34.3445560Z 
2025-05-07T20:26:34.3445564Z 
2025-05-07T20:26:34.3445568Z 
2025-05-07T20:26:34.3445571Z 
2025-05-07T20:26:34.3445748Z                                                      [A[A[A[A[A
2025-05-07T20:26:34.3445954Z 
2025-05-07T20:26:34.3445957Z 
2025-05-07T20:26:34.3445961Z 
2025-05-07T20:26:34.3445965Z 
2025-05-07T20:26:34.3445968Z 
2025-05-07T20:26:34.3445972Z 
2025-05-07T20:26:34.3446159Z                                                      [A[A[A[A[A[A done
2025-05-07T20:26:34.4449035Z Preparing transaction: \ done
2025-05-07T20:26:34.6458720Z Verifying transaction: / - done
2025-05-07T20:26:34.7467800Z Executing transaction: | done
2025-05-07T20:26:34.9227885Z ################################################################################
2025-05-07T20:26:34.9228381Z # Install Package From PyTorch PIP: torch
2025-05-07T20:26:34.9228799Z #
2025-05-07T20:26:34.9248193Z # [2025-05-07T20:26:34.924Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.6.3
2025-05-07T20:26:34.9248873Z ################################################################################
2025-05-07T20:26:34.9249177Z 
2025-05-07T20:26:34.9265251Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:26:35.0172924Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:26:35.0173878Z ################################################################################
2025-05-07T20:26:35.0174736Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:26:35.0175382Z #
2025-05-07T20:26:35.0192963Z # [2025-05-07T20:26:35.018Z] + __prepare_pip_arguments torch nightly cuda/12.6.3
2025-05-07T20:26:35.0193813Z ################################################################################
2025-05-07T20:26:35.0194083Z 
2025-05-07T20:26:35.0215381Z [INSTALL] Extracted package (channel, version): (nightly, LATEST)
2025-05-07T20:26:35.0240710Z [INSTALL] Extracted package variant: cu126
2025-05-07T20:26:35.0257297Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:26:35.0258016Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:26:35.0266241Z [INSTALL] Extracted the full PIP package: --pre torch
2025-05-07T20:26:35.0275288Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu126/ ...
2025-05-07T20:26:35.0296621Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:27:55.6040625Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:27:55.6041191Z Collecting torch
2025-05-07T20:27:55.6041869Z   Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp39-cp39-manylinux_2_28_x86_64.whl.metadata (30 kB)
2025-05-07T20:27:55.6042770Z Collecting filelock (from torch)
2025-05-07T20:27:55.6043432Z   Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB)
2025-05-07T20:27:55.6044666Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from torch) (4.13.2)
2025-05-07T20:27:55.6045381Z Collecting sympy>=1.13.3 (from torch)
2025-05-07T20:27:55.6045886Z   Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB)
2025-05-07T20:27:55.6046732Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 35.7 MB/s eta 0:00:00
2025-05-07T20:27:55.6047080Z Collecting networkx (from torch)
2025-05-07T20:27:55.6047582Z   Downloading https://download.pytorch.org/whl/nightly/networkx-3.2.1-py3-none-any.whl (1.6 MB)
2025-05-07T20:27:55.6048298Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 18.8 MB/s eta 0:00:00
2025-05-07T20:27:55.6048643Z Collecting jinja2 (from torch)
2025-05-07T20:27:55.6049123Z   Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB)
2025-05-07T20:27:55.6049640Z Collecting fsspec (from torch)
2025-05-07T20:27:55.6050148Z   Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB)
2025-05-07T20:27:55.6050721Z Collecting nvidia-cuda-nvrtc-cu12==12.6.77 (from torch)
2025-05-07T20:27:55.6051440Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (23.7 MB)
2025-05-07T20:27:55.6052266Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 64.2 MB/s eta 0:00:00
2025-05-07T20:27:55.6052686Z Collecting nvidia-cuda-runtime-cu12==12.6.77 (from torch)
2025-05-07T20:27:55.6053415Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (897 kB)
2025-05-07T20:27:55.6054208Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 897.7/897.7 kB 10.4 MB/s eta 0:00:00
2025-05-07T20:27:55.6054614Z Collecting nvidia-cuda-cupti-cu12==12.6.80 (from torch)
2025-05-07T20:27:55.6055332Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.whl (8.9 MB)
2025-05-07T20:27:55.6056106Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 43.6 MB/s eta 0:00:00
2025-05-07T20:27:55.6056491Z Collecting nvidia-cudnn-cu12==9.5.1.17 (from torch)
2025-05-07T20:27:55.6057176Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl (571.0 MB)
2025-05-07T20:27:55.6057944Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 571.0/571.0 MB 34.2 MB/s eta 0:00:00
2025-05-07T20:27:55.6058332Z Collecting nvidia-cublas-cu12==12.6.4.1 (from torch)
2025-05-07T20:27:55.6059852Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (393.1 MB)
2025-05-07T20:27:55.6060724Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 393.1/393.1 MB 58.9 MB/s eta 0:00:00
2025-05-07T20:27:55.6061309Z Collecting nvidia-cufft-cu12==11.3.0.4 (from torch)
2025-05-07T20:27:55.6061988Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.whl (200.2 MB)
2025-05-07T20:27:55.6062755Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.2/200.2 MB 128.9 MB/s eta 0:00:00
2025-05-07T20:27:55.6063151Z Collecting nvidia-curand-cu12==10.3.7.77 (from torch)
2025-05-07T20:27:55.6063843Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.whl (56.3 MB)
2025-05-07T20:27:55.6064615Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 MB 192.6 MB/s eta 0:00:00
2025-05-07T20:27:55.6065001Z Collecting nvidia-cusolver-cu12==11.7.1.2 (from torch)
2025-05-07T20:27:55.6065711Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.whl (158.2 MB)
2025-05-07T20:27:55.6066489Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 158.2/158.2 MB 152.8 MB/s eta 0:00:00
2025-05-07T20:27:55.6066893Z Collecting nvidia-cusparse-cu12==12.5.4.2 (from torch)
2025-05-07T20:27:55.6067587Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.whl (216.6 MB)
2025-05-07T20:27:55.6068416Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 216.6/216.6 MB 128.4 MB/s eta 0:00:00
2025-05-07T20:27:55.6068809Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch)
2025-05-07T20:27:55.6069502Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB)
2025-05-07T20:27:55.6070358Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 163.0 MB/s eta 0:00:00
2025-05-07T20:27:55.6070732Z Collecting nvidia-nccl-cu12==2.26.2 (from torch)
2025-05-07T20:27:55.6071505Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB)
2025-05-07T20:27:55.6072284Z Collecting nvidia-nvtx-cu12==12.6.77 (from torch)
2025-05-07T20:27:55.6072938Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (89 kB)
2025-05-07T20:27:55.6073612Z Collecting nvidia-nvjitlink-cu12==12.6.85 (from torch)
2025-05-07T20:27:55.6074391Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (19.7 MB)
2025-05-07T20:27:55.6075245Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.7/19.7 MB 149.3 MB/s eta 0:00:00
2025-05-07T20:27:55.6075634Z Collecting nvidia-cufile-cu12==1.11.1.6 (from torch)
2025-05-07T20:27:55.6076431Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:27:55.6077244Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch)
2025-05-07T20:27:55.6078063Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:27:55.6079463Z Requirement already satisfied: setuptools>=40.8.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from pytorch-triton==3.3.0+git96316ce5->torch) (78.1.1)
2025-05-07T20:27:55.6080315Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch)
2025-05-07T20:27:55.6080871Z   Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB)
2025-05-07T20:27:55.6081510Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 56.1 MB/s eta 0:00:00
2025-05-07T20:27:55.6081884Z Collecting MarkupSafe>=2.0 (from jinja2->torch)
2025-05-07T20:27:55.6082668Z   Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)
2025-05-07T20:27:55.6083697Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp39-cp39-manylinux_2_28_x86_64.whl (825.5 MB)
2025-05-07T20:27:55.6084505Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 825.5/825.5 MB 36.7 MB/s eta 0:00:00
2025-05-07T20:27:55.6085274Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.1 MB)
2025-05-07T20:27:55.6086128Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 14.1 MB/s eta 0:00:00
2025-05-07T20:27:55.6086876Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB)
2025-05-07T20:27:55.6087719Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 101.7 MB/s eta 0:00:00
2025-05-07T20:27:55.6088514Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.4 MB)
2025-05-07T20:27:55.6089380Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.4/153.4 MB 131.5 MB/s eta 0:00:00
2025-05-07T20:27:55.6091130Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch
2025-05-07T20:27:55.6092781Z 
2025-05-07T20:27:55.6094765Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.2.1 nvidia-cublas-cu12-12.6.4.1 nvidia-cuda-cupti-cu12-12.6.80 nvidia-cuda-nvrtc-cu12-12.6.77 nvidia-cuda-runtime-cu12-12.6.77 nvidia-cudnn-cu12-9.5.1.17 nvidia-cufft-cu12-11.3.0.4 nvidia-cufile-cu12-1.11.1.6 nvidia-curand-cu12-10.3.7.77 nvidia-cusolver-cu12-11.7.1.2 nvidia-cusparse-cu12-12.5.4.2 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.6.85 nvidia-nvtx-cu12-12.6.77 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu126
2025-05-07T20:27:55.6096826Z 
2025-05-07T20:27:57.8298594Z torch                    2.8.0.dev20250507+cu126
2025-05-07T20:27:57.8300944Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu126)
2025-05-07T20:28:01.2209390Z [CHECK] Python (sub-)package 'torch.distributed' found ...
2025-05-07T20:28:04.6348408Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu126
2025-05-07T20:28:04.6348880Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ...
2025-05-07T20:28:07.9653180Z True
2025-05-07T20:28:07.9653411Z True
2025-05-07T20:28:07.9653510Z 
2025-05-07T20:28:08.0266318Z [INSTALL] Successfully installed PyTorch through PyTorch PIP
2025-05-07T20:28:08.0302968Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi
2025-05-07T20:28:08.0303576Z [36;1mif . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi[0m
2025-05-07T20:28:08.0317989Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:08.0318507Z env:
2025-05-07T20:28:08.0318732Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:08.0319026Z   BUILD_ENV: build_binary
2025-05-07T20:28:08.0319271Z   BUILD_TARGET: genai
2025-05-07T20:28:08.0319498Z   BUILD_VARIANT: cuda
2025-05-07T20:28:08.0319738Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:28:08.0319984Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:08.0320285Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:08.0320623Z ##[endgroup]
2025-05-07T20:28:08.3660200Z /home/ec2-user/miniconda/bin/conda
2025-05-07T20:28:08.3661736Z ################################################################################
2025-05-07T20:28:08.3662215Z # Collect PyTorch Environment Information (for Reporting Issues)
2025-05-07T20:28:08.3662577Z #
2025-05-07T20:28:08.3677266Z # [2025-05-07T20:28:08.367Z] + collect_pytorch_env_info build_binary
2025-05-07T20:28:08.3677667Z ################################################################################
2025-05-07T20:28:08.3677897Z 
2025-05-07T20:28:08.3693092Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:08.4599569Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:08.4609577Z [INFO] Downloading the PyTorch environment info collection script ...
2025-05-07T20:28:08.4610197Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
2025-05-07T20:28:08.4610591Z 
2025-05-07T20:28:08.5510538Z 
2025-05-07T20:28:08.5511232Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ...
2025-05-07T20:28:08.5534687Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python collect_env.py
2025-05-07T20:28:14.8470830Z Collecting environment information...
2025-05-07T20:28:14.8471187Z PyTorch version: 2.8.0.dev20250507+cu126
2025-05-07T20:28:14.8471472Z Is debug build: False
2025-05-07T20:28:14.8471823Z CUDA used to build PyTorch: 12.6
2025-05-07T20:28:14.8472182Z ROCM used to build PyTorch: N/A
2025-05-07T20:28:14.8472354Z 
2025-05-07T20:28:14.8472468Z OS: Amazon Linux 2023.6.20250317 (x86_64)
2025-05-07T20:28:14.8472776Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:28:14.8473090Z Clang version: Could not collect
2025-05-07T20:28:14.8473359Z CMake version: Could not collect
2025-05-07T20:28:14.8473639Z Libc version: glibc-2.34
2025-05-07T20:28:14.8473791Z 
2025-05-07T20:28:14.8474090Z Python version: 3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:33:10)  [GCC 12.3.0] (64-bit runtime)
2025-05-07T20:28:14.8474701Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34
2025-05-07T20:28:14.8475225Z Is CUDA available: True
2025-05-07T20:28:14.8475559Z CUDA runtime version: 12.6.85
2025-05-07T20:28:14.8475818Z CUDA_MODULE_LOADING set to: LAZY
2025-05-07T20:28:14.8476121Z GPU models and configuration: GPU 0: NVIDIA A10G
2025-05-07T20:28:14.8476446Z Nvidia driver version: 570.133.07
2025-05-07T20:28:14.8476720Z cuDNN version: Could not collect
2025-05-07T20:28:14.8476981Z HIP runtime version: N/A
2025-05-07T20:28:14.8477229Z MIOpen runtime version: N/A
2025-05-07T20:28:14.8477485Z Is XNNPACK available: True
2025-05-07T20:28:14.8477650Z 
2025-05-07T20:28:14.8477752Z CPU:
2025-05-07T20:28:14.8478046Z Architecture:                         x86_64
2025-05-07T20:28:14.8478519Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:28:14.8489783Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:28:14.8490341Z Byte Order:                           Little Endian
2025-05-07T20:28:14.8490704Z CPU(s):                               16
2025-05-07T20:28:14.8490989Z On-line CPU(s) list:                  0-15
2025-05-07T20:28:14.8491774Z Vendor ID:                            AuthenticAMD
2025-05-07T20:28:14.8492252Z Model name:                           AMD EPYC 7R32
2025-05-07T20:28:14.8492696Z CPU family:                           23
2025-05-07T20:28:14.8493045Z Model:                                49
2025-05-07T20:28:14.8493504Z Thread(s) per core:                   2
2025-05-07T20:28:14.8493779Z Core(s) per socket:                   8
2025-05-07T20:28:14.8494043Z Socket(s):                            1
2025-05-07T20:28:14.8494304Z Stepping:                             0
2025-05-07T20:28:14.8494587Z BogoMIPS:                             5600.00
2025-05-07T20:28:14.8496647Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:28:14.8498691Z Hypervisor vendor:                    KVM
2025-05-07T20:28:14.8498996Z Virtualization type:                  full
2025-05-07T20:28:14.8499329Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:28:14.8499692Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:28:14.8500041Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:28:14.8500393Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:28:14.8500707Z NUMA node(s):                         1
2025-05-07T20:28:14.8500989Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:28:14.8501319Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:28:14.8501697Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:28:14.8502051Z Vulnerability L1tf:                   Not affected
2025-05-07T20:28:14.8502387Z Vulnerability Mds:                    Not affected
2025-05-07T20:28:14.8502734Z Vulnerability Meltdown:               Not affected
2025-05-07T20:28:14.8503097Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:28:14.8503447Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:28:14.8504302Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:28:14.8504882Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:28:14.8505411Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:28:14.8506084Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:28:14.8506935Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:28:14.8507603Z Vulnerability Srbds:                  Not affected
2025-05-07T20:28:14.8507951Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:28:14.8508182Z 
2025-05-07T20:28:14.8508287Z Versions of relevant libraries:
2025-05-07T20:28:14.8508548Z [pip3] numpy==2.0.2
2025-05-07T20:28:14.8508789Z [pip3] nvidia-cublas-cu12==12.6.4.1
2025-05-07T20:28:14.8509084Z [pip3] nvidia-cuda-cupti-cu12==12.6.80
2025-05-07T20:28:14.8509387Z [pip3] nvidia-cuda-nvrtc-cu12==12.6.77
2025-05-07T20:28:14.8509697Z [pip3] nvidia-cuda-runtime-cu12==12.6.77
2025-05-07T20:28:14.8510057Z [pip3] nvidia-cudnn-cu12==9.5.1.17
2025-05-07T20:28:14.8510347Z [pip3] nvidia-cufft-cu12==11.3.0.4
2025-05-07T20:28:14.8510636Z [pip3] nvidia-curand-cu12==10.3.7.77
2025-05-07T20:28:14.8510928Z [pip3] nvidia-cusolver-cu12==11.7.1.2
2025-05-07T20:28:14.8511231Z [pip3] nvidia-cusparse-cu12==12.5.4.2
2025-05-07T20:28:14.8511738Z [pip3] nvidia-cusparselt-cu12==0.6.3
2025-05-07T20:28:14.8512029Z [pip3] nvidia-nccl-cu12==2.26.2
2025-05-07T20:28:14.8512331Z [pip3] nvidia-nvjitlink-cu12==12.6.85
2025-05-07T20:28:14.8512654Z [pip3] nvidia-nvtx-cu12==12.6.77
2025-05-07T20:28:14.8512935Z [pip3] pytorch-triton==3.3.0+git96316ce5
2025-05-07T20:28:14.8513360Z [pip3] torch==2.8.0.dev20250507+cu126
2025-05-07T20:28:14.8513732Z [conda] cuda-cudart               12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:14.8514218Z [conda] cuda-cudart-dev           12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:14.8514716Z [conda] cuda-cudart-dev_linux-64  12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:28:14.8515233Z [conda] cuda-cudart-static        12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:14.8515760Z [conda] cuda-cudart-static_linux-64 12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:28:14.8516276Z [conda] cuda-cudart_linux-64      12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:28:14.8516758Z [conda] cuda-cupti                12.6.80              hbd13f7d_0    conda-forge
2025-05-07T20:28:14.8517221Z [conda] cuda-cupti-dev            12.6.80              h5888daf_0    conda-forge
2025-05-07T20:28:14.8517693Z [conda] cuda-libraries            12.6.3               ha770c72_0    conda-forge
2025-05-07T20:28:14.8518176Z [conda] cuda-libraries-dev        12.6.3               ha770c72_0    conda-forge
2025-05-07T20:28:14.8518650Z [conda] cuda-nvrtc                12.6.85              hbd13f7d_0    conda-forge
2025-05-07T20:28:14.8519103Z [conda] cuda-nvrtc-dev            12.6.85              h5888daf_0    conda-forge
2025-05-07T20:28:14.8519552Z [conda] cuda-nvtx                 12.6.77              hbd13f7d_0    conda-forge
2025-05-07T20:28:14.8519990Z [conda] cuda-opencl               12.6.77              hbd13f7d_0    conda-forge
2025-05-07T20:28:14.8520460Z [conda] cuda-opencl-dev           12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:14.8520938Z [conda] cuda-runtime              12.6.3               ha804496_0    conda-forge
2025-05-07T20:28:14.8521384Z [conda] libcublas                 12.6.4.1             h5888daf_1    conda-forge
2025-05-07T20:28:14.8521845Z [conda] libcublas-dev             12.6.4.1             h5888daf_1    conda-forge
2025-05-07T20:28:14.8522308Z [conda] libcufft                  11.3.0.4             hbd13f7d_0    conda-forge
2025-05-07T20:28:14.8522760Z [conda] libcufft-dev              11.3.0.4             h5888daf_0    conda-forge
2025-05-07T20:28:14.8523208Z [conda] libcurand                 10.3.7.77            hbd13f7d_0    conda-forge
2025-05-07T20:28:14.8523669Z [conda] libcurand-dev             10.3.7.77            h5888daf_0    conda-forge
2025-05-07T20:28:14.8524136Z [conda] libcusolver               11.7.1.2             h5888daf_1    conda-forge
2025-05-07T20:28:14.8524600Z [conda] libcusolver-dev           11.7.1.2             h5888daf_1    conda-forge
2025-05-07T20:28:14.8525079Z [conda] libcusparse               12.5.4.2             hbd13f7d_0    conda-forge
2025-05-07T20:28:14.8525550Z [conda] libcusparse-dev           12.5.4.2             h5888daf_0    conda-forge
2025-05-07T20:28:14.8526028Z [conda] libnvjitlink              12.6.85              hbd13f7d_0    conda-forge
2025-05-07T20:28:14.8526500Z [conda] libnvjitlink-dev          12.6.85              h5888daf_0    conda-forge
2025-05-07T20:28:14.8526959Z [conda] numpy                     2.0.2            py39h9cb892a_1    conda-forge
2025-05-07T20:28:14.8527409Z [conda] nvidia-cublas-cu12        12.6.4.1                 pypi_0    pypi
2025-05-07T20:28:14.8527891Z [conda] nvidia-cuda-cupti-cu12    12.6.80                  pypi_0    pypi
2025-05-07T20:28:14.8528380Z [conda] nvidia-cuda-nvrtc-cu12    12.6.77                  pypi_0    pypi
2025-05-07T20:28:14.8528877Z [conda] nvidia-cuda-runtime-cu12  12.6.77                  pypi_0    pypi
2025-05-07T20:28:14.8529362Z [conda] nvidia-cudnn-cu12         9.5.1.17                 pypi_0    pypi
2025-05-07T20:28:14.8529925Z [conda] nvidia-cufft-cu12         11.3.0.4                 pypi_0    pypi
2025-05-07T20:28:14.8530397Z [conda] nvidia-curand-cu12        10.3.7.77                pypi_0    pypi
2025-05-07T20:28:14.8530878Z [conda] nvidia-cusolver-cu12      11.7.1.2                 pypi_0    pypi
2025-05-07T20:28:14.8531447Z [conda] nvidia-cusparse-cu12      12.5.4.2                 pypi_0    pypi
2025-05-07T20:28:14.8531935Z [conda] nvidia-cusparselt-cu12    0.6.3                    pypi_0    pypi
2025-05-07T20:28:14.8532414Z [conda] nvidia-nccl-cu12          2.26.2                   pypi_0    pypi
2025-05-07T20:28:14.8532944Z [conda] nvidia-nvjitlink-cu12     12.6.85                  pypi_0    pypi
2025-05-07T20:28:14.8533411Z [conda] nvidia-nvtx-cu12          12.6.77                  pypi_0    pypi
2025-05-07T20:28:14.8533883Z [conda] pytorch-triton            3.3.0+git96316ce5          pypi_0    pypi
2025-05-07T20:28:14.8534342Z [conda] torch                     2.8.0.dev20250507+cu126          pypi_0    pypi
2025-05-07T20:28:14.8534606Z 
2025-05-07T20:28:14.9161569Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV
2025-05-07T20:28:14.9162127Z [36;1m. $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV[0m
2025-05-07T20:28:14.9176279Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:14.9176645Z env:
2025-05-07T20:28:14.9176870Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:14.9177173Z   BUILD_ENV: build_binary
2025-05-07T20:28:14.9177407Z   BUILD_TARGET: genai
2025-05-07T20:28:14.9177764Z   BUILD_VARIANT: cuda
2025-05-07T20:28:14.9177999Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:28:14.9178255Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:14.9178549Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:14.9178880Z ##[endgroup]
2025-05-07T20:28:15.2573357Z ################################################################################
2025-05-07T20:28:15.2573744Z # Prepare FBGEMM-GPU Build
2025-05-07T20:28:15.2573996Z #
2025-05-07T20:28:15.2588299Z # [2025-05-07T20:28:15.258Z] + prepare_fbgemm_gpu_build build_binary
2025-05-07T20:28:15.2588709Z ################################################################################
2025-05-07T20:28:15.2588922Z 
2025-05-07T20:28:15.2604412Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:15.3516140Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:15.3537123Z [BUILD] Running git submodules update ...
2025-05-07T20:28:15.3558248Z [EXEC] [ATTEMPT 0/3]    + git submodule sync
2025-05-07T20:28:15.3921971Z Synchronizing submodule url for '../external/asmjit'
2025-05-07T20:28:15.3922436Z Synchronizing submodule url for '../external/composable_kernel'
2025-05-07T20:28:15.3922876Z Synchronizing submodule url for '../external/cpuinfo'
2025-05-07T20:28:15.3923256Z Synchronizing submodule url for '../external/cutlass'
2025-05-07T20:28:15.3923655Z Synchronizing submodule url for '../external/googletest'
2025-05-07T20:28:15.3924094Z Synchronizing submodule url for '../external/hipify_torch'
2025-05-07T20:28:15.3924499Z Synchronizing submodule url for '../external/json'
2025-05-07T20:28:15.3957503Z [EXEC] [ATTEMPT 0/3]    + git submodule update --init --recursive
2025-05-07T20:28:15.4500656Z [BUILD] Installing other build dependencies ...
2025-05-07T20:28:15.4522950Z [EXEC] [ATTEMPT 0/3]    + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt
2025-05-07T20:28:17.9176865Z Collecting backports.tarfile (from -r requirements.txt (line 13))
2025-05-07T20:28:17.9295123Z   Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB)
2025-05-07T20:28:18.0390948Z Collecting build (from -r requirements.txt (line 14))
2025-05-07T20:28:18.0426439Z   Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
2025-05-07T20:28:18.2963029Z Collecting cmake (from -r requirements.txt (line 15))
2025-05-07T20:28:18.3000367Z   Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB)
2025-05-07T20:28:18.4145054Z Collecting click (from -r requirements.txt (line 16))
2025-05-07T20:28:18.4178977Z   Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
2025-05-07T20:28:18.7861561Z Collecting hypothesis (from -r requirements.txt (line 17))
2025-05-07T20:28:18.7894385Z   Downloading hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB)
2025-05-07T20:28:18.8521997Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from -r requirements.txt (line 18)) (3.1.4)
2025-05-07T20:28:18.8527137Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from -r requirements.txt (line 19)) (1.3.0)
2025-05-07T20:28:18.9405026Z Collecting ninja (from -r requirements.txt (line 20))
2025-05-07T20:28:18.9438096Z   Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB)
2025-05-07T20:28:18.9981028Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from -r requirements.txt (line 21)) (2.0.2)
2025-05-07T20:28:19.0583664Z Collecting pyre-extensions (from -r requirements.txt (line 22))
2025-05-07T20:28:19.0615784Z   Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB)
2025-05-07T20:28:19.1876564Z Collecting pyyaml (from -r requirements.txt (line 23))
2025-05-07T20:28:19.1937147Z   Downloading PyYAML-6.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
2025-05-07T20:28:19.3217101Z Collecting scikit-build (from -r requirements.txt (line 24))
2025-05-07T20:28:19.3268410Z   Downloading scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB)
2025-05-07T20:28:19.3763406Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from -r requirements.txt (line 25)) (78.1.1)
2025-05-07T20:28:19.4465060Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26))
2025-05-07T20:28:19.4497604Z   Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB)
2025-05-07T20:28:19.5503122Z Collecting tabulate (from -r requirements.txt (line 27))
2025-05-07T20:28:19.5555637Z   Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
2025-05-07T20:28:19.6737761Z Collecting patchelf (from -r requirements.txt (line 28))
2025-05-07T20:28:19.6770134Z   Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB)
2025-05-07T20:28:19.7929501Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14))
2025-05-07T20:28:19.7960277Z   Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB)
2025-05-07T20:28:19.8973557Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14))
2025-05-07T20:28:19.9003245Z   Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB)
2025-05-07T20:28:20.0505837Z Collecting importlib-metadata>=4.6 (from build->-r requirements.txt (line 14))
2025-05-07T20:28:20.0538100Z   Downloading importlib_metadata-8.7.0-py3-none-any.whl.metadata (4.8 kB)
2025-05-07T20:28:20.1744481Z Collecting tomli>=1.1.0 (from build->-r requirements.txt (line 14))
2025-05-07T20:28:20.1779245Z   Downloading tomli-2.2.1-py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:20.2886034Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:20.2920439Z   Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:20.4564756Z Collecting exceptiongroup>=1.0.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:20.4600615Z   Downloading exceptiongroup-1.2.2-py3-none-any.whl.metadata (6.6 kB)
2025-05-07T20:28:20.5642543Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:20.5680054Z   Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:20.6272040Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5)
2025-05-07T20:28:20.6803205Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:28:20.6834793Z   Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
2025-05-07T20:28:20.7304705Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2)
2025-05-07T20:28:20.7873218Z Collecting distro (from scikit-build->-r requirements.txt (line 24))
2025-05-07T20:28:20.7903945Z   Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
2025-05-07T20:28:20.8370526Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1)
2025-05-07T20:28:20.9181921Z Collecting zipp>=3.20 (from importlib-metadata>=4.6->build->-r requirements.txt (line 14))
2025-05-07T20:28:20.9212330Z   Downloading zipp-3.21.0-py3-none-any.whl.metadata (3.7 kB)
2025-05-07T20:28:21.0325363Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:28:21.0356285Z   Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB)
2025-05-07T20:28:21.0908777Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB)
2025-05-07T20:28:21.1394832Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB)
2025-05-07T20:28:21.1859897Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB)
2025-05-07T20:28:21.6691715Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 57.7 MB/s eta 0:00:00
2025-05-07T20:28:21.6732894Z Downloading click-8.1.8-py3-none-any.whl (98 kB)
2025-05-07T20:28:21.7234875Z Downloading hypothesis-6.131.14-py3-none-any.whl (500 kB)
2025-05-07T20:28:21.7757517Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
2025-05-07T20:28:21.8239889Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB)
2025-05-07T20:28:21.8782853Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB)
2025-05-07T20:28:21.9256636Z Downloading PyYAML-6.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (737 kB)
2025-05-07T20:28:21.9840102Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 737.4/737.4 kB 8.5 MB/s eta 0:00:00
2025-05-07T20:28:21.9882408Z Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB)
2025-05-07T20:28:22.0380882Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB)
2025-05-07T20:28:22.0887265Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
2025-05-07T20:28:22.1374118Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB)
2025-05-07T20:28:22.1937325Z Downloading attrs-25.3.0-py3-none-any.whl (63 kB)
2025-05-07T20:28:22.2410758Z Downloading exceptiongroup-1.2.2-py3-none-any.whl (16 kB)
2025-05-07T20:28:22.2899336Z Downloading importlib_metadata-8.7.0-py3-none-any.whl (27 kB)
2025-05-07T20:28:22.3379625Z Downloading packaging-25.0-py3-none-any.whl (66 kB)
2025-05-07T20:28:22.3881579Z Downloading tomli-2.2.1-py3-none-any.whl (14 kB)
2025-05-07T20:28:22.4355721Z Downloading zipp-3.21.0-py3-none-any.whl (9.6 kB)
2025-05-07T20:28:22.4866813Z Downloading distro-1.9.0-py3-none-any.whl (20 kB)
2025-05-07T20:28:22.5469761Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB)
2025-05-07T20:28:22.5989247Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
2025-05-07T20:28:22.6488967Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB)
2025-05-07T20:28:22.8954871Z Installing collected packages: sortedcontainers, zipp, tomli, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, exceptiongroup, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, importlib-metadata, hypothesis, pyre-extensions, build
2025-05-07T20:28:25.3675867Z 
2025-05-07T20:28:25.3753197Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 exceptiongroup-1.2.2 hypothesis-6.131.14 importlib-metadata-8.7.0 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 tomli-2.2.1 typing-inspect-0.9.0 zipp-3.21.0
2025-05-07T20:28:25.5582277Z ################################################################################
2025-05-07T20:28:25.5582652Z # Install PyTorch (PyTorch PIP)
2025-05-07T20:28:25.5582922Z #
2025-05-07T20:28:25.5599117Z # [2025-05-07T20:28:25.559Z] + install_triton_pip build_binary
2025-05-07T20:28:25.5599564Z ################################################################################
2025-05-07T20:28:25.5599867Z 
2025-05-07T20:28:25.5600121Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ...
2025-05-07T20:28:25.5600914Z ################################################################################
2025-05-07T20:28:25.5601283Z # Install Package From PyTorch PIP: pytorch-triton
2025-05-07T20:28:25.5601607Z #
2025-05-07T20:28:25.5616561Z # [2025-05-07T20:28:25.561Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8
2025-05-07T20:28:25.5617220Z ################################################################################
2025-05-07T20:28:25.5617442Z 
2025-05-07T20:28:25.5632129Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:25.6527225Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:25.6527654Z ################################################################################
2025-05-07T20:28:25.6527984Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:28:25.6528260Z #
2025-05-07T20:28:25.6544842Z # [2025-05-07T20:28:25.654Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 
2025-05-07T20:28:25.6545451Z ################################################################################
2025-05-07T20:28:25.6545669Z 
2025-05-07T20:28:25.6592389Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8)
2025-05-07T20:28:25.6609100Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:28:25.6609595Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:25.6618714Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:28:25.6627877Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ...
2025-05-07T20:28:25.6649037Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:32.9647424Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
2025-05-07T20:28:32.9648760Z torch 2.8.0.dev20250507+cu126 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux" and platform_machine == "x86_64", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible.
2025-05-07T20:28:32.9649489Z 
2025-05-07T20:28:32.9649714Z Looking in indexes: https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:32.9650135Z Collecting pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:28:32.9650919Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB)
2025-05-07T20:28:32.9652119Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.4 MB)
2025-05-07T20:28:32.9653194Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.4/166.4 MB 61.8 MB/s eta 0:00:00
2025-05-07T20:28:32.9653580Z Installing collected packages: pytorch-triton
2025-05-07T20:28:32.9653923Z   Attempting uninstall: pytorch-triton
2025-05-07T20:28:32.9654310Z     Found existing installation: pytorch-triton 3.3.0+git96316ce5
2025-05-07T20:28:32.9654734Z     Uninstalling pytorch-triton-3.3.0+git96316ce5:
2025-05-07T20:28:32.9655151Z       Successfully uninstalled pytorch-triton-3.3.0+git96316ce5
2025-05-07T20:28:32.9655935Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8
2025-05-07T20:28:32.9656196Z 
2025-05-07T20:28:35.1755769Z [CHECK] Python (sub-)package 'triton' found ...
2025-05-07T20:28:35.1759099Z [CHECK] Printing out the pytorch-triton version ...
2025-05-07T20:28:37.3102508Z ################################################################################
2025-05-07T20:28:37.3103101Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0
2025-05-07T20:28:37.3103469Z ################################################################################
2025-05-07T20:28:37.3103968Z 
2025-05-07T20:28:39.3401693Z [CHECK] Python (sub-)package 'numpy' found ...
2025-05-07T20:28:41.5147053Z [CHECK] Python (sub-)package 'skbuild' found ...
2025-05-07T20:28:41.5150430Z [BUILD] Successfully ran git submodules update
2025-05-07T20:28:41.5202833Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl
2025-05-07T20:28:41.5203307Z [36;1m. $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl[0m
2025-05-07T20:28:41.5216747Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:41.5217096Z env:
2025-05-07T20:28:41.5217324Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:41.5217613Z   BUILD_ENV: build_binary
2025-05-07T20:28:41.5217861Z   BUILD_TARGET: genai
2025-05-07T20:28:41.5218090Z   BUILD_VARIANT: cuda
2025-05-07T20:28:41.5218320Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:28:41.5218571Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:41.5218865Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:41.5219214Z ##[endgroup]
2025-05-07T20:28:41.8553391Z ################################################################################
2025-05-07T20:28:41.8553899Z # Install FBGEMM-GPU from Wheel
2025-05-07T20:28:41.8554250Z #
2025-05-07T20:28:41.8570001Z # [2025-05-07T20:28:41.856Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:28:41.8570633Z ################################################################################
2025-05-07T20:28:41.8570852Z 
2025-05-07T20:28:41.8571204Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:28:41.8571953Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:28:41.8572281Z 
2025-05-07T20:28:41.8691424Z d4ed0368510af43fe003d0e644f3e214a1184cea  fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:28:41.8693800Z 
2025-05-07T20:28:41.8694387Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:28:41.8694730Z 
2025-05-07T20:28:41.8829441Z 9230f5ec3cd9c0291353aa93f1630c572cadce13d0b83330ed75b92574b61dfc  fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:28:41.8831701Z 
2025-05-07T20:28:41.8832207Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:28:41.9061622Z 
2025-05-07T20:28:41.9062128Z 9e24861bd267fb8b82804cd45222d975  fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:28:41.9064355Z 
2025-05-07T20:28:41.9074071Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl ...
2025-05-07T20:28:41.9095168Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:28:44.6290076Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:28:44.6291023Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.0.2)
2025-05-07T20:28:44.6291868Z Installing collected packages: fbgemm-gpu-genai-nightly
2025-05-07T20:28:44.6292306Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7
2025-05-07T20:28:44.6292572Z 
2025-05-07T20:28:51.4409012Z ################################################################################
2025-05-07T20:28:51.4409738Z [CHECK] !!!!    INFO    !!!!
2025-05-07T20:28:51.4410465Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu126
2025-05-07T20:28:51.4411066Z [CHECK] CUDA version reported by PyTorch is: 12.6
2025-05-07T20:28:51.4411368Z [CHECK]
2025-05-07T20:28:51.4411691Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU
2025-05-07T20:28:51.4412184Z [CHECK]       package channel; the package may be broken at runtime!!!
2025-05-07T20:28:51.4412568Z ################################################################################
2025-05-07T20:28:51.4412786Z 
2025-05-07T20:28:51.4412899Z [INSTALL] Checking imports and symbols ...
2025-05-07T20:28:55.3058194Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:28:59.1830543Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'.
2025-05-07T20:29:03.0725566Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'.
2025-05-07T20:29:03.0729150Z [CHECK] Printing out the FBGEMM-GPU version ...
2025-05-07T20:29:14.7303157Z ################################################################################
2025-05-07T20:29:14.7303573Z [CHECK] The installed FBGEMM TARGET is: genai
2025-05-07T20:29:14.7304047Z [CHECK] The installed FBGEMM VARIANT is: cuda
2025-05-07T20:29:14.7304380Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7
2025-05-07T20:29:14.7304719Z ################################################################################
2025-05-07T20:29:14.7304937Z 
2025-05-07T20:29:22.4900624Z ################################################################################
2025-05-07T20:29:22.4901020Z [CHECK] FBGEMM_GPU Experimental Packages
2025-05-07T20:29:22.4902401Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils']
2025-05-07T20:29:22.4904175Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']
2025-05-07T20:29:22.4904694Z ################################################################################
2025-05-07T20:29:22.4904911Z 
2025-05-07T20:29:22.4905070Z [INSTALL] Check for installation of Python sources ...
2025-05-07T20:29:26.3713128Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ...
2025-05-07T20:29:30.2458119Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ...
2025-05-07T20:29:34.2591676Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ...
2025-05-07T20:29:38.1647329Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ...
2025-05-07T20:29:38.1651567Z [INSTALL] Check for operator registrations ...
2025-05-07T20:29:41.9794718Z fbgemm.nccl_init
2025-05-07T20:29:41.9794909Z 
2025-05-07T20:29:42.0433310Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init
2025-05-07T20:29:45.8619385Z fbgemm.gqa_attn_splitk
2025-05-07T20:29:45.8619663Z 
2025-05-07T20:29:45.9235509Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk
2025-05-07T20:29:49.7571237Z fbgemm.rope_qkv_decoding
2025-05-07T20:29:49.7571449Z 
2025-05-07T20:29:49.8185779Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding
2025-05-07T20:29:49.8186379Z [INSTALL] FBGEMM-GPU installation through wheel completed ...
2025-05-07T20:29:49.8221668Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV
2025-05-07T20:29:49.8222146Z [36;1m. $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV[0m
2025-05-07T20:29:49.8236477Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:29:49.8236833Z env:
2025-05-07T20:29:49.8237239Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:29:49.8237548Z   BUILD_ENV: build_binary
2025-05-07T20:29:49.8237796Z   BUILD_TARGET: genai
2025-05-07T20:29:49.8238027Z   BUILD_VARIANT: cuda
2025-05-07T20:29:49.8238261Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:29:49.8238523Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:29:49.8238829Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:29:49.8239157Z ##[endgroup]
2025-05-07T20:29:50.1597439Z ################################################################################
2025-05-07T20:29:50.1597804Z # Test All FBGEMM-GPU Modules
2025-05-07T20:29:50.1598064Z #
2025-05-07T20:29:50.1612919Z # [2025-05-07T20:29:50.160Z] + test_all_fbgemm_gpu_modules build_binary
2025-05-07T20:29:50.1613343Z ################################################################################
2025-05-07T20:29:50.1613556Z 
2025-05-07T20:29:57.9302948Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda)
2025-05-07T20:29:57.9303881Z [TEST] Will be running tests specific to this target and variant ...
2025-05-07T20:29:57.9304385Z [TEST] Determined the test directories:
2025-05-07T20:29:57.9304693Z fbgemm_gpu/experimental/gen_ai/test
2025-05-07T20:29:57.9304997Z fbgemm_gpu/experimental/example/test
2025-05-07T20:29:57.9305295Z fbgemm_gpu/experimental/gemm/test
2025-05-07T20:29:57.9305481Z 
2025-05-07T20:29:57.9312510Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ...
2025-05-07T20:29:57.9320378Z [TEST] Set environment variables for CUDA testing ...
2025-05-07T20:29:57.9320974Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES
2025-05-07T20:29:57.9321361Z 
2025-05-07T20:29:58.3613271Z 
2025-05-07T20:29:58.3613599Z [TEST] Installing PyTest ...
2025-05-07T20:29:58.3636721Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest
2025-05-07T20:29:59.4685232Z Channels:
2025-05-07T20:29:59.4685538Z  - conda-forge
2025-05-07T20:29:59.4685864Z Platform: linux-64
2025-05-07T20:30:02.6987950Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:30:03.8460730Z Solving environment: \ | / done
2025-05-07T20:30:04.0755002Z 
2025-05-07T20:30:04.0755673Z ## Package Plan ##
2025-05-07T20:30:04.0755914Z 
2025-05-07T20:30:04.0756196Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:30:04.0756598Z 
2025-05-07T20:30:04.0756700Z   added / updated specs:
2025-05-07T20:30:04.0756949Z     - expecttest
2025-05-07T20:30:04.0757166Z     - pytest
2025-05-07T20:30:04.0757286Z 
2025-05-07T20:30:04.0757290Z 
2025-05-07T20:30:04.0757413Z The following packages will be downloaded:
2025-05-07T20:30:04.0757660Z 
2025-05-07T20:30:04.0757776Z     package                    |            build
2025-05-07T20:30:04.0758100Z     ---------------------------|-----------------
2025-05-07T20:30:04.0758616Z     colorama-0.4.6             |     pyhd8ed1ab_1          26 KB  conda-forge
2025-05-07T20:30:04.0759269Z     exceptiongroup-1.2.2       |     pyhd8ed1ab_1          20 KB  conda-forge
2025-05-07T20:30:04.0759895Z     expecttest-0.3.0           |     pyhd8ed1ab_0          14 KB  conda-forge
2025-05-07T20:30:04.0760345Z     iniconfig-2.0.0            |     pyhd8ed1ab_1          11 KB  conda-forge
2025-05-07T20:30:04.0760772Z     packaging-25.0             |     pyh29332c3_1          61 KB  conda-forge
2025-05-07T20:30:04.0761199Z     pluggy-1.5.0               |     pyhd8ed1ab_1          23 KB  conda-forge
2025-05-07T20:30:04.0761611Z     pytest-8.3.5               |     pyhd8ed1ab_0         254 KB  conda-forge
2025-05-07T20:30:04.0762372Z     tomli-2.2.1                |     pyhd8ed1ab_1          19 KB  conda-forge
2025-05-07T20:30:04.0762756Z     ------------------------------------------------------------
2025-05-07T20:30:04.0763096Z                                            Total:         428 KB
2025-05-07T20:30:04.0763302Z 
2025-05-07T20:30:04.0763438Z The following NEW packages will be INSTALLED:
2025-05-07T20:30:04.0763817Z 
2025-05-07T20:30:04.0764022Z   colorama           conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 
2025-05-07T20:30:04.0764521Z   exceptiongroup     conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 
2025-05-07T20:30:04.0765040Z   expecttest         conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 
2025-05-07T20:30:04.0765513Z   iniconfig          conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 
2025-05-07T20:30:04.0765971Z   packaging          conda-forge/noarch::packaging-25.0-pyh29332c3_1 
2025-05-07T20:30:04.0766417Z   pluggy             conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 
2025-05-07T20:30:04.0766849Z   pytest             conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 
2025-05-07T20:30:04.0767266Z   tomli              conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 
2025-05-07T20:30:04.0767518Z 
2025-05-07T20:30:04.0767522Z 
2025-05-07T20:30:04.0767526Z 
2025-05-07T20:30:04.0767674Z Downloading and Extracting Packages: ...working...
2025-05-07T20:30:04.0768045Z pytest-8.3.5         | 254 KB    |            |   0% 
2025-05-07T20:30:04.0768275Z 
2025-05-07T20:30:04.0768655Z packaging-25.0       | 61 KB     |            |   0% [A
2025-05-07T20:30:04.0768888Z 
2025-05-07T20:30:04.0768892Z 
2025-05-07T20:30:04.0783688Z colorama-0.4.6       | 26 KB     |            |   0% [A[A
2025-05-07T20:30:04.0783934Z 
2025-05-07T20:30:04.0783938Z 
2025-05-07T20:30:04.0783942Z 
2025-05-07T20:30:04.0793927Z pluggy-1.5.0         | 23 KB     |            |   0% [A[A[A
2025-05-07T20:30:04.0794194Z 
2025-05-07T20:30:04.0794199Z 
2025-05-07T20:30:04.0794209Z 
2025-05-07T20:30:04.0799199Z 
2025-05-07T20:30:04.0833205Z exceptiongroup-1.2.2 | 20 KB     |            |   0% [A[A[A[A
2025-05-07T20:30:04.0833707Z 
2025-05-07T20:30:04.0833715Z 
2025-05-07T20:30:04.0833731Z 
2025-05-07T20:30:04.0833737Z 
2025-05-07T20:30:04.0833743Z 
2025-05-07T20:30:04.0834863Z tomli-2.2.1          | 19 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:30:04.0835262Z 
2025-05-07T20:30:04.0835270Z 
2025-05-07T20:30:04.0835290Z 
2025-05-07T20:30:04.0835306Z 
2025-05-07T20:30:04.0835312Z 
2025-05-07T20:30:04.0835323Z 
2025-05-07T20:30:04.0837376Z expecttest-0.3.0     | 14 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:30:04.0837816Z 
2025-05-07T20:30:04.0837831Z 
2025-05-07T20:30:04.0837836Z 
2025-05-07T20:30:04.0837842Z 
2025-05-07T20:30:04.0837847Z 
2025-05-07T20:30:04.0837852Z 
2025-05-07T20:30:04.0837857Z 
2025-05-07T20:30:04.2388019Z iniconfig-2.0.0      | 11 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:30:04.2388573Z 
2025-05-07T20:30:04.2388577Z 
2025-05-07T20:30:04.2399835Z colorama-0.4.6       | 26 KB     | ######     |  61% [A[A
2025-05-07T20:30:04.2400083Z 
2025-05-07T20:30:04.2400096Z 
2025-05-07T20:30:04.2401553Z 
2025-05-07T20:30:04.2415735Z pluggy-1.5.0         | 23 KB     | ######9    |  69% [A[A[A
2025-05-07T20:30:04.2415984Z 
2025-05-07T20:30:04.2415988Z 
2025-05-07T20:30:04.2415992Z 
2025-05-07T20:30:04.2419195Z 
2025-05-07T20:30:04.2444870Z exceptiongroup-1.2.2 | 20 KB     | #######9   |  80% [A[A[A[A
2025-05-07T20:30:04.2445186Z 
2025-05-07T20:30:04.2445190Z 
2025-05-07T20:30:04.2448978Z 
2025-05-07T20:30:04.2461813Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:04.2462071Z 
2025-05-07T20:30:04.2465024Z 
2025-05-07T20:30:04.2469870Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:04.2470318Z 
2025-05-07T20:30:04.2470330Z 
2025-05-07T20:30:04.2470334Z 
2025-05-07T20:30:04.2470339Z 
2025-05-07T20:30:04.2817684Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:04.2818029Z 
2025-05-07T20:30:04.2818033Z 
2025-05-07T20:30:04.2818045Z 
2025-05-07T20:30:04.2818140Z 
2025-05-07T20:30:04.2820927Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:04.2821231Z 
2025-05-07T20:30:04.2821243Z 
2025-05-07T20:30:04.2821247Z 
2025-05-07T20:30:04.2821251Z 
2025-05-07T20:30:04.2821254Z 
2025-05-07T20:30:04.2821258Z 
2025-05-07T20:30:04.2821262Z 
2025-05-07T20:30:04.2827885Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:04.2828172Z 
2025-05-07T20:30:04.2828175Z 
2025-05-07T20:30:04.2828179Z 
2025-05-07T20:30:04.2828183Z 
2025-05-07T20:30:04.2828186Z 
2025-05-07T20:30:04.2828190Z 
2025-05-07T20:30:04.2828194Z 
2025-05-07T20:30:04.2835669Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:04.2835937Z 
2025-05-07T20:30:04.2835941Z 
2025-05-07T20:30:04.2836573Z 
2025-05-07T20:30:04.2877438Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:04.2877691Z 
2025-05-07T20:30:04.2877695Z 
2025-05-07T20:30:04.2924711Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:04.2924956Z 
2025-05-07T20:30:04.2924967Z 
2025-05-07T20:30:04.2924971Z 
2025-05-07T20:30:04.2924975Z 
2025-05-07T20:30:04.2924979Z 
2025-05-07T20:30:04.2929355Z tomli-2.2.1          | 19 KB     | ########5  |  85% [A[A[A[A[A
2025-05-07T20:30:04.2929607Z 
2025-05-07T20:30:04.2929627Z 
2025-05-07T20:30:04.2929632Z 
2025-05-07T20:30:04.2929636Z 
2025-05-07T20:30:04.2929639Z 
2025-05-07T20:30:04.2936210Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:04.2936454Z 
2025-05-07T20:30:04.2936465Z 
2025-05-07T20:30:04.2936468Z 
2025-05-07T20:30:04.2936472Z 
2025-05-07T20:30:04.2936476Z 
2025-05-07T20:30:04.2936480Z 
2025-05-07T20:30:04.2936483Z 
2025-05-07T20:30:04.3012801Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:04.3013078Z 
2025-05-07T20:30:04.3013082Z 
2025-05-07T20:30:04.3013086Z 
2025-05-07T20:30:04.3013089Z 
2025-05-07T20:30:04.3013093Z 
2025-05-07T20:30:04.3015425Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:04.3015675Z 
2025-05-07T20:30:04.3015679Z 
2025-05-07T20:30:04.3015683Z 
2025-05-07T20:30:04.3015686Z 
2025-05-07T20:30:04.3015690Z 
2025-05-07T20:30:04.3015694Z 
2025-05-07T20:30:04.3019080Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:04.3019400Z 
2025-05-07T20:30:04.3019405Z 
2025-05-07T20:30:04.3019409Z 
2025-05-07T20:30:04.3019413Z 
2025-05-07T20:30:04.3019416Z 
2025-05-07T20:30:04.3019420Z 
2025-05-07T20:30:04.3086239Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:04.3086523Z 
2025-05-07T20:30:04.3086528Z 
2025-05-07T20:30:04.3086532Z 
2025-05-07T20:30:04.3086535Z 
2025-05-07T20:30:04.3086539Z 
2025-05-07T20:30:04.3086542Z 
2025-05-07T20:30:04.3316094Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:04.3351988Z pytest-8.3.5         | 254 KB    | 6          |   6% 
2025-05-07T20:30:04.3471675Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:04.3471965Z 
2025-05-07T20:30:04.3492470Z packaging-25.0       | 61 KB     | ##6        |  26% [A
2025-05-07T20:30:04.3493138Z 
2025-05-07T20:30:04.3648542Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:04.3649026Z 
2025-05-07T20:30:04.3676892Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:04.3683012Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:04.3683350Z                                                      
2025-05-07T20:30:04.3683554Z 
2025-05-07T20:30:04.3683765Z                                                      [A
2025-05-07T20:30:04.3684034Z 
2025-05-07T20:30:04.3684038Z 
2025-05-07T20:30:04.3684217Z                                                      [A[A
2025-05-07T20:30:04.3684425Z 
2025-05-07T20:30:04.3684429Z 
2025-05-07T20:30:04.3684433Z 
2025-05-07T20:30:04.3684608Z                                                      [A[A[A
2025-05-07T20:30:04.3685003Z 
2025-05-07T20:30:04.3685008Z 
2025-05-07T20:30:04.3685012Z 
2025-05-07T20:30:04.3685016Z 
2025-05-07T20:30:04.3685222Z                                                      [A[A[A[A
2025-05-07T20:30:04.3685461Z 
2025-05-07T20:30:04.3685464Z 
2025-05-07T20:30:04.3685468Z 
2025-05-07T20:30:04.3685471Z 
2025-05-07T20:30:04.3685600Z 
2025-05-07T20:30:04.3685797Z                                                      [A[A[A[A[A
2025-05-07T20:30:04.3686037Z 
2025-05-07T20:30:04.3686040Z 
2025-05-07T20:30:04.3686043Z 
2025-05-07T20:30:04.3686047Z 
2025-05-07T20:30:04.3686050Z 
2025-05-07T20:30:04.3686054Z 
2025-05-07T20:30:04.3686249Z                                                      [A[A[A[A[A[A
2025-05-07T20:30:04.3686461Z 
2025-05-07T20:30:04.3686465Z 
2025-05-07T20:30:04.3686469Z 
2025-05-07T20:30:04.3686472Z 
2025-05-07T20:30:04.3686476Z 
2025-05-07T20:30:04.3686479Z 
2025-05-07T20:30:04.3686483Z 
2025-05-07T20:30:04.3686685Z                                                      [A[A[A[A[A[A[A done
2025-05-07T20:30:04.4691283Z Preparing transaction: \ done
2025-05-07T20:30:04.5697176Z Verifying transaction: / done
2025-05-07T20:30:06.4724621Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:30:06.5961073Z [TEST] Checking imports ...
2025-05-07T20:30:10.4505678Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:30:10.4516905Z [TEST] Setting feature flags ...
2025-05-07T20:30:10.4517354Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1
2025-05-07T20:30:10.4517692Z 
2025-05-07T20:30:10.8794208Z 
2025-05-07T20:30:10.8794752Z [TEST] PyTest args:  -v -rsx -s -W ignore::pytest.PytestCollectionWarning
2025-05-07T20:30:10.8795211Z ################################################################################
2025-05-07T20:30:10.8795519Z # Run FBGEMM-GPU Tests: 
2025-05-07T20:30:10.8795760Z #
2025-05-07T20:30:10.8813031Z # [2025-05-07T20:30:10.880Z] + __run_fbgemm_gpu_tests_in_directory build_binary
2025-05-07T20:30:10.8813443Z ################################################################################
2025-05-07T20:30:10.8813666Z 
2025-05-07T20:30:10.8820341Z [TEST] Enumerating ALL test files ...
2025-05-07T20:30:10.8849209Z ./attention/gqa_test.py
2025-05-07T20:30:10.8849504Z ./coalesce/coalesce_test.py
2025-05-07T20:30:10.8849758Z ./comm/multi_gpu_car_test.py
2025-05-07T20:30:10.8850037Z ./gather_scatter/gather_scatter_test.py
2025-05-07T20:30:10.8850334Z ./kv_cache/kv_cache_test.py
2025-05-07T20:30:10.8850577Z ./moe/activation_test.py
2025-05-07T20:30:10.8850827Z ./moe/gather_scatter_test.py
2025-05-07T20:30:10.8851078Z ./moe/layers_test.py
2025-05-07T20:30:10.8851310Z ./moe/shuffling_test.py
2025-05-07T20:30:10.8851543Z ./quantize/quantize_test.py
2025-05-07T20:30:10.8851709Z 
2025-05-07T20:30:10.8851823Z [TEST] Enumerating IGNORED test files ...
2025-05-07T20:30:10.8852032Z 
2025-05-07T20:30:10.8869955Z ################################################################################
2025-05-07T20:30:10.8885193Z # [2025-05-07T20:30:10.888Z] Run Python Test Suite:
2025-05-07T20:30:10.8885512Z #   ./attention/gqa_test.py
2025-05-07T20:30:10.8885795Z ################################################################################
2025-05-07T20:30:10.8910344Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py
2025-05-07T20:30:10.8910953Z 
2025-05-07T20:30:13.4277109Z ============================= test session starts ==============================
2025-05-07T20:30:13.4277966Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:30:13.4278500Z cachedir: .pytest_cache
2025-05-07T20:30:13.4279083Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:30:13.4280075Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:30:13.4280510Z plugins: hypothesis-6.131.14
2025-05-07T20:30:14.9426489Z collecting ... collected 2 items
2025-05-07T20:30:14.9427017Z 
2025-05-07T20:30:51.7868324Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa(
2025-05-07T20:30:51.7869078Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7870120Z     int4_kv=False,
2025-05-07T20:30:51.7870383Z     num_groups=1,
2025-05-07T20:30:51.7870626Z     B=1,
2025-05-07T20:30:51.7870850Z     MAX_T=4,
2025-05-07T20:30:51.7871083Z     N_H_L=1,
2025-05-07T20:30:51.7871308Z )
2025-05-07T20:30:51.7871543Z Trying example: test_gqa(
2025-05-07T20:30:51.7871900Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7872285Z     int4_kv=True,
2025-05-07T20:30:51.7872535Z     num_groups=1,
2025-05-07T20:30:51.7872783Z     B=1,
2025-05-07T20:30:51.7872994Z     MAX_T=4,
2025-05-07T20:30:51.7873219Z     N_H_L=1,
2025-05-07T20:30:51.7873440Z )
2025-05-07T20:30:51.7873673Z Trying example: test_gqa(
2025-05-07T20:30:51.7874024Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7874399Z     int4_kv=True,
2025-05-07T20:30:51.7874640Z     num_groups=4,
2025-05-07T20:30:51.7874882Z     B=23,
2025-05-07T20:30:51.7875104Z     MAX_T=33,
2025-05-07T20:30:51.7875331Z     N_H_L=68,
2025-05-07T20:30:51.7875568Z )
2025-05-07T20:30:51.7875792Z Trying example: test_gqa(
2025-05-07T20:30:51.7876132Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7876503Z     int4_kv=True,
2025-05-07T20:30:51.7876757Z     num_groups=4,
2025-05-07T20:30:51.7876990Z     B=77,
2025-05-07T20:30:51.7877210Z     MAX_T=4,
2025-05-07T20:30:51.7877450Z     N_H_L=1,
2025-05-07T20:30:51.7877665Z )
2025-05-07T20:30:51.7877895Z Trying example: test_gqa(
2025-05-07T20:30:51.7878241Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7878608Z     int4_kv=True,
2025-05-07T20:30:51.7878854Z     num_groups=4,
2025-05-07T20:30:51.7879092Z     B=77,
2025-05-07T20:30:51.7879311Z     MAX_T=52,
2025-05-07T20:30:51.7879542Z     N_H_L=67,
2025-05-07T20:30:51.7879766Z )
2025-05-07T20:30:51.7879983Z Trying example: test_gqa(
2025-05-07T20:30:51.7880328Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7880700Z     int4_kv=False,
2025-05-07T20:30:51.7880948Z     num_groups=4,
2025-05-07T20:30:51.7881190Z     B=57,
2025-05-07T20:30:51.7881409Z     MAX_T=45,
2025-05-07T20:30:51.7881640Z     N_H_L=120,
2025-05-07T20:30:51.7881866Z )
2025-05-07T20:30:51.7882094Z Trying example: test_gqa(
2025-05-07T20:30:51.7882446Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7882812Z     int4_kv=True,
2025-05-07T20:30:51.7883057Z     num_groups=4,
2025-05-07T20:30:51.7883297Z     B=52,
2025-05-07T20:30:51.7883510Z     MAX_T=42,
2025-05-07T20:30:51.7883741Z     N_H_L=53,
2025-05-07T20:30:51.7883963Z )
2025-05-07T20:30:51.7884180Z Trying example: test_gqa(
2025-05-07T20:30:51.7884528Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7884903Z     int4_kv=True,
2025-05-07T20:30:51.7885151Z     num_groups=1,
2025-05-07T20:30:51.7885394Z     B=77,
2025-05-07T20:30:51.7885610Z     MAX_T=95,
2025-05-07T20:30:51.7885830Z     N_H_L=53,
2025-05-07T20:30:51.7886057Z )
2025-05-07T20:30:51.7886288Z Trying example: test_gqa(
2025-05-07T20:30:51.7886632Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7887006Z     int4_kv=True,
2025-05-07T20:30:51.7887253Z     num_groups=4,
2025-05-07T20:30:51.7887487Z     B=113,
2025-05-07T20:30:51.7887709Z     MAX_T=48,
2025-05-07T20:30:51.7887937Z     N_H_L=96,
2025-05-07T20:30:51.7888153Z )
2025-05-07T20:30:51.7888378Z Trying example: test_gqa(
2025-05-07T20:30:51.7888722Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7889091Z     int4_kv=False,
2025-05-07T20:30:51.7889369Z     num_groups=1,
2025-05-07T20:30:51.7889633Z     B=51,
2025-05-07T20:30:51.7889858Z     MAX_T=61,
2025-05-07T20:30:51.7890081Z     N_H_L=69,
2025-05-07T20:30:51.7890547Z )
2025-05-07T20:30:51.7890780Z Trying example: test_gqa(
2025-05-07T20:30:51.7891117Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7891492Z     int4_kv=False,
2025-05-07T20:30:51.7891745Z     num_groups=4,
2025-05-07T20:30:51.7891977Z     B=17,
2025-05-07T20:30:51.7892197Z     MAX_T=113,
2025-05-07T20:30:51.7892522Z     N_H_L=65,
2025-05-07T20:30:51.7892746Z )
2025-05-07T20:30:51.7892970Z Trying example: test_gqa(
2025-05-07T20:30:51.7893317Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7893685Z     int4_kv=False,
2025-05-07T20:30:51.7893942Z     num_groups=4,
2025-05-07T20:30:51.7894186Z     B=17,
2025-05-07T20:30:51.7894399Z     MAX_T=65,
2025-05-07T20:30:51.7894629Z     N_H_L=65,
2025-05-07T20:30:51.7894853Z )
2025-05-07T20:30:51.7895070Z Trying example: test_gqa(
2025-05-07T20:30:51.7895416Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7895787Z     int4_kv=False,
2025-05-07T20:30:51.7896029Z     num_groups=4,
2025-05-07T20:30:51.7896278Z     B=65,
2025-05-07T20:30:51.7896497Z     MAX_T=65,
2025-05-07T20:30:51.7896723Z     N_H_L=65,
2025-05-07T20:30:51.7896946Z )
2025-05-07T20:30:51.7897173Z Trying example: test_gqa(
2025-05-07T20:30:51.7897511Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7897890Z     int4_kv=False,
2025-05-07T20:30:51.7898151Z     num_groups=1,
2025-05-07T20:30:51.7898392Z     B=6,
2025-05-07T20:30:51.7898606Z     MAX_T=108,
2025-05-07T20:30:51.7898840Z     N_H_L=14,
2025-05-07T20:30:51.7899064Z )
2025-05-07T20:30:51.7899287Z Trying example: test_gqa(
2025-05-07T20:30:51.7899670Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7900066Z     int4_kv=False,
2025-05-07T20:30:51.7900312Z     num_groups=1,
2025-05-07T20:30:51.7900554Z     B=6,
2025-05-07T20:30:51.7900770Z     MAX_T=14,
2025-05-07T20:30:51.7900998Z     N_H_L=14,
2025-05-07T20:30:51.7901224Z )
2025-05-07T20:30:51.7901450Z Trying example: test_gqa(
2025-05-07T20:30:51.7901794Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7902167Z     int4_kv=False,
2025-05-07T20:30:51.7902417Z     num_groups=1,
2025-05-07T20:30:51.7902651Z     B=6,
2025-05-07T20:30:51.7902872Z     MAX_T=6,
2025-05-07T20:30:51.7903101Z     N_H_L=14,
2025-05-07T20:30:51.7903324Z )
2025-05-07T20:30:51.7903557Z Trying example: test_gqa(
2025-05-07T20:30:51.7904172Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7904541Z     int4_kv=False,
2025-05-07T20:30:51.7904789Z     num_groups=1,
2025-05-07T20:30:51.7905033Z     B=6,
2025-05-07T20:30:51.7905243Z     MAX_T=6,
2025-05-07T20:30:51.7905469Z     N_H_L=6,
2025-05-07T20:30:51.7905689Z )
2025-05-07T20:30:51.7905908Z Trying example: test_gqa(
2025-05-07T20:30:51.7906254Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7906623Z     int4_kv=False,
2025-05-07T20:30:51.7906872Z     num_groups=1,
2025-05-07T20:30:51.7907114Z     B=70,
2025-05-07T20:30:51.7907333Z     MAX_T=94,
2025-05-07T20:30:51.7907565Z     N_H_L=78,
2025-05-07T20:30:51.7907788Z )
2025-05-07T20:30:51.7908011Z Trying example: test_gqa(
2025-05-07T20:30:51.7908356Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7908727Z     int4_kv=False,
2025-05-07T20:30:51.7908978Z     num_groups=1,
2025-05-07T20:30:51.7909217Z     B=78,
2025-05-07T20:30:51.7909437Z     MAX_T=94,
2025-05-07T20:30:51.7909662Z     N_H_L=78,
2025-05-07T20:30:51.7909966Z )
2025-05-07T20:30:51.7910188Z Trying example: test_gqa(
2025-05-07T20:30:51.7910529Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7910902Z     int4_kv=False,
2025-05-07T20:30:51.7911143Z     num_groups=1,
2025-05-07T20:30:51.7911380Z     B=94,
2025-05-07T20:30:51.7911596Z     MAX_T=94,
2025-05-07T20:30:51.7911817Z     N_H_L=78,
2025-05-07T20:30:51.7912042Z )
2025-05-07T20:30:51.7912266Z Trying example: test_gqa(
2025-05-07T20:30:51.7912607Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7912989Z     int4_kv=False,
2025-05-07T20:30:51.7913404Z     num_groups=1,
2025-05-07T20:30:51.7913646Z     B=94,
2025-05-07T20:30:51.7913873Z     MAX_T=94,
2025-05-07T20:30:51.7914100Z     N_H_L=94,
2025-05-07T20:30:51.7914319Z )
2025-05-07T20:30:51.7914543Z Trying example: test_gqa(
2025-05-07T20:30:51.7914894Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7915381Z     int4_kv=False,
2025-05-07T20:30:51.7915638Z     num_groups=4,
2025-05-07T20:30:51.7915878Z     B=41,
2025-05-07T20:30:51.7916092Z     MAX_T=105,
2025-05-07T20:30:51.7916331Z     N_H_L=126,
2025-05-07T20:30:51.7916563Z )
2025-05-07T20:30:51.7916791Z Trying example: test_gqa(
2025-05-07T20:30:51.7917131Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7917516Z     int4_kv=False,
2025-05-07T20:30:51.7917792Z     num_groups=4,
2025-05-07T20:30:51.7918042Z     B=105,
2025-05-07T20:30:51.7918273Z     MAX_T=105,
2025-05-07T20:30:51.7918519Z     N_H_L=126,
2025-05-07T20:30:51.7918754Z )
2025-05-07T20:30:51.7918998Z Trying example: test_gqa(
2025-05-07T20:30:51.7919322Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7919651Z     int4_kv=False,
2025-05-07T20:30:51.7919855Z     num_groups=4,
2025-05-07T20:30:51.7920051Z     B=105,
2025-05-07T20:30:51.7920235Z     MAX_T=105,
2025-05-07T20:30:51.7920420Z     N_H_L=105,
2025-05-07T20:30:51.7920614Z )
2025-05-07T20:30:51.7920807Z Trying example: test_gqa(
2025-05-07T20:30:51.7921096Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7921408Z     int4_kv=True,
2025-05-07T20:30:51.7921614Z     num_groups=1,
2025-05-07T20:30:51.7921816Z     B=95,
2025-05-07T20:30:51.7922007Z     MAX_T=114,
2025-05-07T20:30:51.7922206Z     N_H_L=43,
2025-05-07T20:30:51.7922387Z )
2025-05-07T20:30:51.7922585Z Trying example: test_gqa(
2025-05-07T20:30:51.7922871Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7923169Z     int4_kv=True,
2025-05-07T20:30:51.7923377Z     num_groups=1,
2025-05-07T20:30:51.7923577Z     B=43,
2025-05-07T20:30:51.7923750Z     MAX_T=114,
2025-05-07T20:30:51.7923946Z     N_H_L=43,
2025-05-07T20:30:51.7924133Z )
2025-05-07T20:30:51.7924317Z Trying example: test_gqa(
2025-05-07T20:30:51.7924606Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7924913Z     int4_kv=True,
2025-05-07T20:30:51.7925116Z     num_groups=1,
2025-05-07T20:30:51.7925315Z     B=43,
2025-05-07T20:30:51.7925497Z     MAX_T=43,
2025-05-07T20:30:51.7925687Z     N_H_L=43,
2025-05-07T20:30:51.7925866Z )
2025-05-07T20:30:51.7926052Z Trying example: test_gqa(
2025-05-07T20:30:51.7926341Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7926639Z     int4_kv=False,
2025-05-07T20:30:51.7926852Z     num_groups=1,
2025-05-07T20:30:51.7927049Z     B=21,
2025-05-07T20:30:51.7927227Z     MAX_T=38,
2025-05-07T20:30:51.7927417Z     N_H_L=42,
2025-05-07T20:30:51.7927598Z )
2025-05-07T20:30:51.7927777Z Trying example: test_gqa(
2025-05-07T20:30:51.7928062Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7928370Z     int4_kv=False,
2025-05-07T20:30:51.7928575Z     num_groups=1,
2025-05-07T20:30:51.7928773Z     B=38,
2025-05-07T20:30:51.7928961Z     MAX_T=38,
2025-05-07T20:30:51.7929147Z     N_H_L=42,
2025-05-07T20:30:51.7929329Z )
2025-05-07T20:30:51.7929516Z Trying example: test_gqa(
2025-05-07T20:30:51.7929842Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7930149Z     int4_kv=False,
2025-05-07T20:30:51.7930360Z     num_groups=1,
2025-05-07T20:30:51.7930559Z     B=38,
2025-05-07T20:30:51.7930748Z     MAX_T=42,
2025-05-07T20:30:51.7930935Z     N_H_L=42,
2025-05-07T20:30:51.7931117Z )
2025-05-07T20:30:51.7931305Z Trying example: test_gqa(
2025-05-07T20:30:51.7931595Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7931903Z     int4_kv=False,
2025-05-07T20:30:51.7932109Z     num_groups=1,
2025-05-07T20:30:51.7932311Z     B=42,
2025-05-07T20:30:51.7932498Z     MAX_T=42,
2025-05-07T20:30:51.7932680Z     N_H_L=42,
2025-05-07T20:30:51.7932870Z )
2025-05-07T20:30:51.7933163Z Trying example: test_gqa(
2025-05-07T20:30:51.7933447Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7933753Z     int4_kv=True,
2025-05-07T20:30:51.7933957Z     num_groups=1,
2025-05-07T20:30:51.7934151Z     B=74,
2025-05-07T20:30:51.7934333Z     MAX_T=20,
2025-05-07T20:30:51.7934518Z     N_H_L=15,
2025-05-07T20:30:51.7934770Z )
2025-05-07T20:30:51.7934957Z Trying example: test_gqa(
2025-05-07T20:30:51.7935240Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7935533Z     int4_kv=True,
2025-05-07T20:30:51.7935739Z     num_groups=1,
2025-05-07T20:30:51.7935936Z     B=20,
2025-05-07T20:30:51.7936114Z     MAX_T=20,
2025-05-07T20:30:51.7936300Z     N_H_L=15,
2025-05-07T20:30:51.7936485Z )
2025-05-07T20:30:51.7936666Z Trying example: test_gqa(
2025-05-07T20:30:51.7936954Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7937261Z     int4_kv=True,
2025-05-07T20:30:51.7937455Z     num_groups=1,
2025-05-07T20:30:51.7937656Z     B=20,
2025-05-07T20:30:51.7937848Z     MAX_T=15,
2025-05-07T20:30:51.7938032Z     N_H_L=15,
2025-05-07T20:30:51.7938214Z )
2025-05-07T20:30:51.7938407Z Trying example: test_gqa(
2025-05-07T20:30:51.7938687Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7938997Z     int4_kv=True,
2025-05-07T20:30:51.7939200Z     num_groups=1,
2025-05-07T20:30:51.7939408Z     B=15,
2025-05-07T20:30:51.7939589Z     MAX_T=20,
2025-05-07T20:30:51.7939779Z     N_H_L=15,
2025-05-07T20:30:51.7939962Z )
2025-05-07T20:30:51.7940145Z Trying example: test_gqa(
2025-05-07T20:30:51.7940427Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7940758Z     int4_kv=True,
2025-05-07T20:30:51.7940954Z     num_groups=1,
2025-05-07T20:30:51.7941151Z     B=15,
2025-05-07T20:30:51.7941337Z     MAX_T=15,
2025-05-07T20:30:51.7941520Z     N_H_L=15,
2025-05-07T20:30:51.7941703Z )
2025-05-07T20:30:51.7941892Z Trying example: test_gqa(
2025-05-07T20:30:51.7942173Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7942482Z     int4_kv=False,
2025-05-07T20:30:51.7942691Z     num_groups=4,
2025-05-07T20:30:51.7942893Z     B=117,
2025-05-07T20:30:51.7943073Z     MAX_T=104,
2025-05-07T20:30:51.7943263Z     N_H_L=69,
2025-05-07T20:30:51.7943449Z )
2025-05-07T20:30:51.7943640Z Trying example: test_gqa(
2025-05-07T20:30:51.7943933Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7944243Z     int4_kv=False,
2025-05-07T20:30:51.7944447Z     num_groups=4,
2025-05-07T20:30:51.7944656Z     B=117,
2025-05-07T20:30:51.7944841Z     MAX_T=117,
2025-05-07T20:30:51.7945030Z     N_H_L=69,
2025-05-07T20:30:51.7945214Z )
2025-05-07T20:30:51.7945398Z Trying example: test_gqa(
2025-05-07T20:30:51.7945680Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7945985Z     int4_kv=False,
2025-05-07T20:30:51.7946193Z     num_groups=4,
2025-05-07T20:30:51.7946383Z     B=69,
2025-05-07T20:30:51.7946572Z     MAX_T=117,
2025-05-07T20:30:51.7946767Z     N_H_L=69,
2025-05-07T20:30:51.7946946Z )
2025-05-07T20:30:51.7947137Z Trying example: test_gqa(
2025-05-07T20:30:51.7947416Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:51.7947744Z     int4_kv=False,
2025-05-07T20:30:51.7947943Z     num_groups=4,
2025-05-07T20:30:51.7948147Z     B=117,
2025-05-07T20:30:51.7948331Z     MAX_T=69,
2025-05-07T20:30:51.7948521Z     N_H_L=69,
2025-05-07T20:30:51.7948713Z )
2025-05-07T20:30:51.7948898Z PASSED
2025-05-07T20:30:51.8307803Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...)
2025-05-07T20:30:51.8308138Z 
2025-05-07T20:30:51.8308287Z =========================== short test summary info ============================
2025-05-07T20:30:51.8308992Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/unittest/case.py:117: Skip when CUDA is not available or xformers is not available
2025-05-07T20:30:51.8309685Z ======================== 1 passed, 1 skipped in 38.92s =========================
2025-05-07T20:30:52.4497665Z 
2025-05-07T20:30:52.4498427Z [TEST] Python test suite PASSED: ./attention/gqa_test.py
2025-05-07T20:30:52.4517651Z [TEST] Python test time for ./attention/gqa_test.py: 42 seconds
2025-05-07T20:30:52.4517983Z 
2025-05-07T20:30:52.4517987Z 
2025-05-07T20:30:52.4517991Z 
2025-05-07T20:30:52.4517995Z 
2025-05-07T20:30:52.4537858Z ################################################################################
2025-05-07T20:30:52.4553351Z # [2025-05-07T20:30:52.455Z] Run Python Test Suite:
2025-05-07T20:30:52.4553701Z #   ./coalesce/coalesce_test.py
2025-05-07T20:30:52.4553995Z ################################################################################
2025-05-07T20:30:52.4579315Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py
2025-05-07T20:30:52.4579937Z 
2025-05-07T20:30:54.6047434Z ============================= test session starts ==============================
2025-05-07T20:30:54.6048529Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:30:54.6049413Z cachedir: .pytest_cache
2025-05-07T20:30:54.6050382Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:30:54.6051716Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:30:54.6052378Z plugins: hypothesis-6.131.14
2025-05-07T20:30:56.1455465Z collecting ... collected 1 item
2025-05-07T20:30:56.1455700Z 
2025-05-07T20:30:56.8724776Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED
2025-05-07T20:30:56.8725287Z 
2025-05-07T20:30:56.8725504Z ============================== 1 passed in 2.40s ===============================
2025-05-07T20:30:57.4714656Z 
2025-05-07T20:30:57.4715360Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py
2025-05-07T20:30:57.4735465Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds
2025-05-07T20:30:57.4735906Z 
2025-05-07T20:30:57.4735924Z 
2025-05-07T20:30:57.4735930Z 
2025-05-07T20:30:57.4735935Z 
2025-05-07T20:30:57.4757626Z ################################################################################
2025-05-07T20:30:57.4773020Z # [2025-05-07T20:30:57.476Z] Run Python Test Suite:
2025-05-07T20:30:57.4773502Z #   ./comm/multi_gpu_car_test.py
2025-05-07T20:30:57.4773921Z ################################################################################
2025-05-07T20:30:57.4797076Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py
2025-05-07T20:30:57.4797826Z 
2025-05-07T20:30:59.6352272Z ============================= test session starts ==============================
2025-05-07T20:30:59.6353136Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:30:59.6353668Z cachedir: .pytest_cache
2025-05-07T20:30:59.6354271Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:30:59.6355003Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:30:59.6355418Z plugins: hypothesis-6.131.14
2025-05-07T20:31:01.2198618Z collecting ... collected 5 items
2025-05-07T20:31:01.2198972Z 
2025-05-07T20:31:01.2209823Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED
2025-05-07T20:31:01.2218274Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED
2025-05-07T20:31:01.2226151Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED
2025-05-07T20:31:01.2234036Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED
2025-05-07T20:31:01.2251101Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED
2025-05-07T20:31:01.2251569Z 
2025-05-07T20:31:01.2252144Z =========================== short test summary info ============================
2025-05-07T20:31:01.2252824Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:01.2253744Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:01.2254804Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:01.2255728Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:01.2256643Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:01.2257284Z ============================== 5 skipped in 1.72s ==============================
2025-05-07T20:31:01.7385554Z 
2025-05-07T20:31:01.7386238Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py
2025-05-07T20:31:01.7405522Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 4 seconds
2025-05-07T20:31:01.7405935Z 
2025-05-07T20:31:01.7405941Z 
2025-05-07T20:31:01.7405947Z 
2025-05-07T20:31:01.7405972Z 
2025-05-07T20:31:01.7428015Z ################################################################################
2025-05-07T20:31:01.7443638Z # [2025-05-07T20:31:01.744Z] Run Python Test Suite:
2025-05-07T20:31:01.7444102Z #   ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:01.7444508Z ################################################################################
2025-05-07T20:31:01.7468458Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:01.7469272Z 
2025-05-07T20:31:03.9219433Z ============================= test session starts ==============================
2025-05-07T20:31:03.9220066Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:03.9220587Z cachedir: .pytest_cache
2025-05-07T20:31:03.9221152Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:03.9221880Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:03.9222294Z plugins: hypothesis-6.131.14
2025-05-07T20:31:05.5765890Z collecting ... collected 2 items
2025-05-07T20:31:05.5766288Z 
2025-05-07T20:31:05.5777379Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED
2025-05-07T20:31:05.5792329Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED
2025-05-07T20:31:05.5792890Z 
2025-05-07T20:31:05.5793084Z =========================== short test summary info ============================
2025-05-07T20:31:05.5793716Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:31:05.5794551Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:31:05.5795161Z ============================== 2 skipped in 1.79s ==============================
2025-05-07T20:31:06.1092211Z 
2025-05-07T20:31:06.1093141Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:06.1112726Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 5 seconds
2025-05-07T20:31:06.1113248Z 
2025-05-07T20:31:06.1113254Z 
2025-05-07T20:31:06.1113260Z 
2025-05-07T20:31:06.1113265Z 
2025-05-07T20:31:06.1135306Z ################################################################################
2025-05-07T20:31:06.1150909Z # [2025-05-07T20:31:06.114Z] Run Python Test Suite:
2025-05-07T20:31:06.1151741Z #   ./kv_cache/kv_cache_test.py
2025-05-07T20:31:06.1152128Z ################################################################################
2025-05-07T20:31:06.1175883Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py
2025-05-07T20:31:06.1176821Z 
2025-05-07T20:31:08.2610537Z ============================= test session starts ==============================
2025-05-07T20:31:08.2611192Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:08.2611719Z cachedir: .pytest_cache
2025-05-07T20:31:08.2612304Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:08.2613021Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:08.2613481Z plugins: hypothesis-6.131.14
2025-05-07T20:31:09.8171686Z collecting ... collected 4 items
2025-05-07T20:31:09.8171998Z 
2025-05-07T20:31:12.8190840Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...)
2025-05-07T20:31:12.8355761Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED
2025-05-07T20:31:12.8552547Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED
2025-05-07T20:31:12.8716913Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED
2025-05-07T20:31:12.8717370Z 
2025-05-07T20:31:12.8717529Z =========================== short test summary info ============================
2025-05-07T20:31:12.8718230Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/unittest/case.py:117: Skip when H100 is not available or MI300 is not available
2025-05-07T20:31:12.8719135Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/unittest/case.py:117: Skip when xformers is not available
2025-05-07T20:31:12.8719759Z ============================== 4 skipped in 4.74s ==============================
2025-05-07T20:31:14.4971109Z 
2025-05-07T20:31:14.4971639Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py
2025-05-07T20:31:14.4988521Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 8 seconds
2025-05-07T20:31:14.4988816Z 
2025-05-07T20:31:14.4988878Z 
2025-05-07T20:31:14.4989037Z 
2025-05-07T20:31:14.4989048Z 
2025-05-07T20:31:14.5011315Z ################################################################################
2025-05-07T20:31:14.5028016Z # [2025-05-07T20:31:14.502Z] Run Python Test Suite:
2025-05-07T20:31:14.5028358Z #   ./moe/activation_test.py
2025-05-07T20:31:14.5028643Z ################################################################################
2025-05-07T20:31:14.5054506Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py
2025-05-07T20:31:14.5055122Z 
2025-05-07T20:31:16.6631304Z ============================= test session starts ==============================
2025-05-07T20:31:16.6631932Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:16.6632450Z cachedir: .pytest_cache
2025-05-07T20:31:16.6633014Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:16.6633748Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:16.6634155Z plugins: hypothesis-6.131.14
2025-05-07T20:31:18.3047011Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:31:18.5181567Z collecting ... collected 2 items
2025-05-07T20:31:18.5181792Z 
2025-05-07T20:31:24.4462540Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul(
2025-05-07T20:31:24.4464179Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4465480Z     T=1,
2025-05-07T20:31:24.4465930Z     D=5120,
2025-05-07T20:31:24.4466474Z     contiguous=True,
2025-05-07T20:31:24.4467003Z     compiled=True,
2025-05-07T20:31:24.4467323Z )
2025-05-07T20:31:24.4467587Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4468042Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4468626Z     T=4096,
2025-05-07T20:31:24.4468824Z     D=5120,
2025-05-07T20:31:24.4469012Z     contiguous=True,
2025-05-07T20:31:24.4469224Z     compiled=True,
2025-05-07T20:31:24.4469428Z )
2025-05-07T20:31:24.4469623Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4470107Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4470486Z     T=4096,
2025-05-07T20:31:24.4470670Z     D=7168,
2025-05-07T20:31:24.4470860Z     contiguous=False,
2025-05-07T20:31:24.4471086Z     compiled=False,
2025-05-07T20:31:24.4471292Z )
2025-05-07T20:31:24.4471480Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4471859Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4472236Z     T=4096,
2025-05-07T20:31:24.4472420Z     D=5120,
2025-05-07T20:31:24.4472612Z     contiguous=False,
2025-05-07T20:31:24.4472835Z     compiled=True,
2025-05-07T20:31:24.4473036Z )
2025-05-07T20:31:24.4473226Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4473606Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4473985Z     T=1,
2025-05-07T20:31:24.4474161Z     D=7168,
2025-05-07T20:31:24.4474357Z     contiguous=True,
2025-05-07T20:31:24.4474574Z     compiled=True,
2025-05-07T20:31:24.4474767Z )
2025-05-07T20:31:24.4474960Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4475327Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4475690Z     T=1,
2025-05-07T20:31:24.4475870Z     D=7168,
2025-05-07T20:31:24.4476065Z     contiguous=False,
2025-05-07T20:31:24.4476277Z     compiled=True,
2025-05-07T20:31:24.4476490Z )
2025-05-07T20:31:24.4476684Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4477042Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4477417Z     T=4096,
2025-05-07T20:31:24.4477604Z     D=5120,
2025-05-07T20:31:24.4477791Z     contiguous=False,
2025-05-07T20:31:24.4478022Z     compiled=False,
2025-05-07T20:31:24.4478224Z )
2025-05-07T20:31:24.4478421Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4478780Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4479152Z     T=1,
2025-05-07T20:31:24.4479328Z     D=7168,
2025-05-07T20:31:24.4479521Z     contiguous=True,
2025-05-07T20:31:24.4479743Z     compiled=False,
2025-05-07T20:31:24.4479951Z )
2025-05-07T20:31:24.4480138Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4480505Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4480873Z     T=2048,
2025-05-07T20:31:24.4481050Z     D=5120,
2025-05-07T20:31:24.4481255Z     contiguous=True,
2025-05-07T20:31:24.4481475Z     compiled=True,
2025-05-07T20:31:24.4481670Z )
2025-05-07T20:31:24.4481875Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4482242Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4482610Z     T=2048,
2025-05-07T20:31:24.4482803Z     D=7168,
2025-05-07T20:31:24.4483004Z     contiguous=True,
2025-05-07T20:31:24.4483217Z     compiled=True,
2025-05-07T20:31:24.4483420Z )
2025-05-07T20:31:24.4483616Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4483974Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4484350Z     T=2048,
2025-05-07T20:31:24.4484540Z     D=7168,
2025-05-07T20:31:24.4484733Z     contiguous=True,
2025-05-07T20:31:24.4484948Z     compiled=False,
2025-05-07T20:31:24.4485156Z )
2025-05-07T20:31:24.4485352Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4485715Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4486189Z     T=128,
2025-05-07T20:31:24.4486374Z     D=5120,
2025-05-07T20:31:24.4486561Z     contiguous=False,
2025-05-07T20:31:24.4486783Z     compiled=True,
2025-05-07T20:31:24.4486985Z )
2025-05-07T20:31:24.4487170Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4487533Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4487981Z     T=128,
2025-05-07T20:31:24.4488155Z     D=5120,
2025-05-07T20:31:24.4488347Z     contiguous=True,
2025-05-07T20:31:24.4488566Z     compiled=True,
2025-05-07T20:31:24.4488760Z )
2025-05-07T20:31:24.4488953Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4489316Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4489683Z     T=16384,
2025-05-07T20:31:24.4489877Z     D=5120,
2025-05-07T20:31:24.4490071Z     contiguous=False,
2025-05-07T20:31:24.4490287Z     compiled=True,
2025-05-07T20:31:24.4490488Z )
2025-05-07T20:31:24.4490679Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4491051Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4491416Z     T=16384,
2025-05-07T20:31:24.4491609Z     D=5120,
2025-05-07T20:31:24.4491800Z     contiguous=False,
2025-05-07T20:31:24.4492018Z     compiled=False,
2025-05-07T20:31:24.4492234Z )
2025-05-07T20:31:24.4492434Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4492802Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4493183Z     T=128,
2025-05-07T20:31:24.4493376Z     D=7168,
2025-05-07T20:31:24.4493562Z     contiguous=True,
2025-05-07T20:31:24.4493794Z     compiled=False,
2025-05-07T20:31:24.4493996Z )
2025-05-07T20:31:24.4494187Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4494553Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4494926Z     T=128,
2025-05-07T20:31:24.4495106Z     D=7168,
2025-05-07T20:31:24.4495303Z     contiguous=False,
2025-05-07T20:31:24.4495530Z     compiled=False,
2025-05-07T20:31:24.4495729Z )
2025-05-07T20:31:24.4495924Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4496295Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4496658Z     T=1,
2025-05-07T20:31:24.4496841Z     D=5120,
2025-05-07T20:31:24.4497038Z     contiguous=False,
2025-05-07T20:31:24.4497266Z     compiled=False,
2025-05-07T20:31:24.4497471Z )
2025-05-07T20:31:24.4497660Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4498023Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4498387Z     T=1,
2025-05-07T20:31:24.4498572Z     D=7168,
2025-05-07T20:31:24.4498760Z     contiguous=False,
2025-05-07T20:31:24.4498974Z     compiled=False,
2025-05-07T20:31:24.4499174Z )
2025-05-07T20:31:24.4499364Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4499724Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4500097Z     T=4096,
2025-05-07T20:31:24.4500283Z     D=5120,
2025-05-07T20:31:24.4500477Z     contiguous=True,
2025-05-07T20:31:24.4500696Z     compiled=False,
2025-05-07T20:31:24.4500901Z )
2025-05-07T20:31:24.4501088Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4501456Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4501829Z     T=128,
2025-05-07T20:31:24.4502012Z     D=7168,
2025-05-07T20:31:24.4502208Z     contiguous=True,
2025-05-07T20:31:24.4502429Z     compiled=True,
2025-05-07T20:31:24.4502625Z )
2025-05-07T20:31:24.4502820Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4503184Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4503560Z     T=1,
2025-05-07T20:31:24.4504075Z     D=5120,
2025-05-07T20:31:24.4504295Z     contiguous=False,
2025-05-07T20:31:24.4504517Z     compiled=True,
2025-05-07T20:31:24.4504710Z )
2025-05-07T20:31:24.4504902Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4505274Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4505789Z     T=4096,
2025-05-07T20:31:24.4505984Z     D=7168,
2025-05-07T20:31:24.4506173Z     contiguous=True,
2025-05-07T20:31:24.4506386Z     compiled=False,
2025-05-07T20:31:24.4506591Z )
2025-05-07T20:31:24.4506786Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4507151Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4507641Z     T=4096,
2025-05-07T20:31:24.4507828Z     D=7168,
2025-05-07T20:31:24.4508023Z     contiguous=False,
2025-05-07T20:31:24.4508249Z     compiled=True,
2025-05-07T20:31:24.4508454Z )
2025-05-07T20:31:24.4508648Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4509020Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4509397Z     T=128,
2025-05-07T20:31:24.4509582Z     D=5120,
2025-05-07T20:31:24.4509771Z     contiguous=True,
2025-05-07T20:31:24.4510072Z     compiled=False,
2025-05-07T20:31:24.4510280Z )
2025-05-07T20:31:24.4510471Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4510846Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4511218Z     T=128,
2025-05-07T20:31:24.4511397Z     D=5120,
2025-05-07T20:31:24.4511588Z     contiguous=False,
2025-05-07T20:31:24.4511815Z     compiled=False,
2025-05-07T20:31:24.4512011Z )
2025-05-07T20:31:24.4512213Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4512582Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4512950Z     T=1,
2025-05-07T20:31:24.4513137Z     D=5120,
2025-05-07T20:31:24.4513329Z     contiguous=True,
2025-05-07T20:31:24.4513547Z     compiled=False,
2025-05-07T20:31:24.4513753Z )
2025-05-07T20:31:24.4513951Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4514314Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4514686Z     T=2048,
2025-05-07T20:31:24.4514872Z     D=7168,
2025-05-07T20:31:24.4515060Z     contiguous=False,
2025-05-07T20:31:24.4515284Z     compiled=True,
2025-05-07T20:31:24.4515490Z )
2025-05-07T20:31:24.4515682Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4516049Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4516423Z     T=2048,
2025-05-07T20:31:24.4516610Z     D=7168,
2025-05-07T20:31:24.4516796Z     contiguous=False,
2025-05-07T20:31:24.4517029Z     compiled=False,
2025-05-07T20:31:24.4517237Z )
2025-05-07T20:31:24.4517428Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4517798Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4518179Z     T=16384,
2025-05-07T20:31:24.4518367Z     D=7168,
2025-05-07T20:31:24.4518560Z     contiguous=False,
2025-05-07T20:31:24.4518784Z     compiled=True,
2025-05-07T20:31:24.4518985Z )
2025-05-07T20:31:24.4519186Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4519552Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4519924Z     T=16384,
2025-05-07T20:31:24.4520117Z     D=7168,
2025-05-07T20:31:24.4520317Z     contiguous=True,
2025-05-07T20:31:24.4520528Z     compiled=True,
2025-05-07T20:31:24.4520731Z )
2025-05-07T20:31:24.4520935Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4521298Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4521679Z     T=4096,
2025-05-07T20:31:24.4521871Z     D=7168,
2025-05-07T20:31:24.4522069Z     contiguous=True,
2025-05-07T20:31:24.4522279Z     compiled=True,
2025-05-07T20:31:24.4522484Z )
2025-05-07T20:31:24.4522684Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4523049Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4523434Z     T=2048,
2025-05-07T20:31:24.4523618Z     D=5120,
2025-05-07T20:31:24.4523805Z     contiguous=False,
2025-05-07T20:31:24.4524034Z     compiled=False,
2025-05-07T20:31:24.4524240Z )
2025-05-07T20:31:24.4524440Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4524951Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4525340Z     T=2048,
2025-05-07T20:31:24.4525531Z     D=5120,
2025-05-07T20:31:24.4525727Z     contiguous=True,
2025-05-07T20:31:24.4525958Z     compiled=False,
2025-05-07T20:31:24.4526168Z )
2025-05-07T20:31:24.4526362Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4526742Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4527191Z     T=128,
2025-05-07T20:31:24.4527376Z     D=7168,
2025-05-07T20:31:24.4527581Z     contiguous=False,
2025-05-07T20:31:24.4527814Z     compiled=True,
2025-05-07T20:31:24.4528023Z )
2025-05-07T20:31:24.4528224Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4528609Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4537641Z     T=16384,
2025-05-07T20:31:24.4537951Z     D=5120,
2025-05-07T20:31:24.4538233Z     contiguous=True,
2025-05-07T20:31:24.4538484Z     compiled=True,
2025-05-07T20:31:24.4538690Z )
2025-05-07T20:31:24.4538897Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4539291Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4539677Z     T=2048,
2025-05-07T20:31:24.4539877Z     D=5120,
2025-05-07T20:31:24.4540088Z     contiguous=False,
2025-05-07T20:31:24.4540329Z     compiled=True,
2025-05-07T20:31:24.4540538Z )
2025-05-07T20:31:24.4540749Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4541126Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4541501Z     T=16384,
2025-05-07T20:31:24.4541700Z     D=5120,
2025-05-07T20:31:24.4541901Z     contiguous=True,
2025-05-07T20:31:24.4542120Z     compiled=False,
2025-05-07T20:31:24.4542332Z )
2025-05-07T20:31:24.4542537Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4542907Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4543286Z     T=16384,
2025-05-07T20:31:24.4543479Z     D=7168,
2025-05-07T20:31:24.4543675Z     contiguous=False,
2025-05-07T20:31:24.4543910Z     compiled=False,
2025-05-07T20:31:24.4544119Z )
2025-05-07T20:31:24.4544310Z Trying example: test_silu_mul(
2025-05-07T20:31:24.4544685Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.4545064Z     T=16384,
2025-05-07T20:31:24.4545254Z     D=7168,
2025-05-07T20:31:24.4545452Z     contiguous=True,
2025-05-07T20:31:24.4545683Z     compiled=False,
2025-05-07T20:31:24.4545889Z )
2025-05-07T20:31:24.4546071Z PASSED
2025-05-07T20:31:24.5151099Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:24.5152321Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:24.5153707Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:24.5155178Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:24.5156578Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:24.5157970Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:24.5159647Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:24.5161052Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:24.5162482Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:24.5163893Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:24.5165124Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:24.5166353Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:24.5167398Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:24.5168435Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:24.5169664Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:24.5170955Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:24.5172086Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:24.5173142Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:24.5174326Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:24.5175696Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:24.5176769Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:24.5177696Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:24.5178457Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:24.5179478Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:24.5324523Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:24.5325753Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:24.5327399Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:24.5328835Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:24.5330217Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:24.5331752Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:24.5333066Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:24.5334459Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:24.5335873Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:24.5337137Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:24.5338360Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:24.5339578Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:24.5340621Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:24.5341643Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:24.5342873Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:24.5344167Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:24.5345288Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:24.5346334Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:24.5347513Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:24.5348873Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:24.5350050Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:24.5350966Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:24.5351864Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:24.5352897Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:24.5746339Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:24.5747548Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:24.5748901Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:24.5750475Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:24.5751854Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:24.5753267Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:24.5754586Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:24.5755977Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:24.5757403Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:24.5758666Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:24.5759893Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:24.5761117Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:24.5762176Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:24.5763207Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:24.5764432Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:24.5765728Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:24.5766846Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:24.5768199Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:24.5769385Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:24.5770881Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:24.5771945Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:24.5772860Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:24.5773616Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:24.5774634Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:24.5788554Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:24.5789772Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:24.5791173Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:24.5792599Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:24.5793987Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:24.5795362Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:24.5796672Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:24.5798060Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:24.5799478Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:24.5800735Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:24.5801960Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:24.5803179Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:24.5805212Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:24.5806283Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:24.5807503Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:24.5808909Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:24.5810028Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:24.5811076Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:24.5812252Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:24.5813622Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:24.5814674Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:24.5815589Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:24.5816336Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:24.5817362Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:25.0843750Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:31:25.0844685Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:25.0845117Z     T=1,
2025-05-07T20:31:25.0845309Z     D=5120,
2025-05-07T20:31:25.0845503Z     scale_ub=None,
2025-05-07T20:31:25.0845723Z     contiguous=True,
2025-05-07T20:31:25.0845949Z     compiled=True,
2025-05-07T20:31:25.0846204Z )
2025-05-07T20:31:25.0846575Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:25.0847256Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:25.0847520Z 
2025-05-07T20:31:25.0847603Z     @given(
2025-05-07T20:31:25.0847838Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:25.0848173Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:25.0848595Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:25.0849036Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:25.0849401Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:25.0849816Z     )
2025-05-07T20:31:25.0850291Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:25.0850887Z     def test_silu_mul_quant(
2025-05-07T20:31:25.0851164Z         self,
2025-05-07T20:31:25.0851351Z         T: int,
2025-05-07T20:31:25.0851547Z         D: int,
2025-05-07T20:31:25.0851761Z         scale_ub: Optional[float],
2025-05-07T20:31:25.0852028Z         contiguous: bool,
2025-05-07T20:31:25.0852264Z         compiled: bool,
2025-05-07T20:31:25.0852491Z     ) -> None:
2025-05-07T20:31:25.0852706Z         torch.manual_seed(2025)
2025-05-07T20:31:25.0852939Z     
2025-05-07T20:31:25.0853524Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:25.0853874Z     
2025-05-07T20:31:25.0854063Z         x_sign = torch.sign(x)
2025-05-07T20:31:25.0854352Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:25.0854662Z         x = x_sign * x_clamp
2025-05-07T20:31:25.0854897Z         x0 = x[:, :D]
2025-05-07T20:31:25.0855253Z         x1 = x[:, D:]
2025-05-07T20:31:25.0855459Z     
2025-05-07T20:31:25.0855640Z         if contiguous:
2025-05-07T20:31:25.0855876Z             x0 = x0.contiguous()
2025-05-07T20:31:25.0856142Z             x1 = x1.contiguous()
2025-05-07T20:31:25.0856375Z     
2025-05-07T20:31:25.0856570Z         if scale_ub is not None:
2025-05-07T20:31:25.0856846Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:25.0857178Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:25.0857489Z             )
2025-05-07T20:31:25.0857685Z         else:
2025-05-07T20:31:25.0857902Z             scale_ub_tensor = None
2025-05-07T20:31:25.0858154Z     
2025-05-07T20:31:25.0858387Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:25.0858708Z             op = silu_mul_quant
2025-05-07T20:31:25.0858953Z             if compiled:
2025-05-07T20:31:25.0859207Z                 op = torch.compile(op)
2025-05-07T20:31:25.0859516Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:25.0859787Z     
2025-05-07T20:31:25.0859982Z         y_fp8, y_scale = fn()
2025-05-07T20:31:25.0860273Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:25.0860564Z     
2025-05-07T20:31:25.0860807Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:25.0861147Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:25.0861440Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:25.0861767Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:25.0862123Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:25.0862429Z     
2025-05-07T20:31:25.0862632Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:25.0862841Z 
2025-05-07T20:31:25.0862946Z moe/activation_test.py:126: 
2025-05-07T20:31:25.0863254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:25.0863580Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:25.0863908Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:25.0864714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:25.0865480Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:25.0866014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:25.0866696Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:25.0867379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:25.0868105Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:25.0868844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:25.0869600Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:25.0870405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:25.0871048Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:25.0871639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:25.0872159Z     fn()
2025-05-07T20:31:25.0872654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:25.0873320Z     self.fn.run(
2025-05-07T20:31:25.0873791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:25.0874322Z     kernel = self.compile(
2025-05-07T20:31:25.0874858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:25.0875581Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:25.0875973Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:25.0876198Z 
2025-05-07T20:31:25.0876412Z self = <triton.compiler.compiler.ASTSource object at 0x7faba5c5b730>
2025-05-07T20:31:25.0877552Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:25.0879021Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faba5c74820>}
2025-05-07T20:31:25.0880377Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:25.0881420Z context = <triton._C.libtriton.ir.context object at 0x7faba5c4d330>
2025-05-07T20:31:25.0881716Z 
2025-05-07T20:31:25.0881886Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:25.0882415Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:25.0882876Z                            module_map=module_map)
2025-05-07T20:31:25.0883251Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:25.0883622Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:25.0883884Z E       ^
2025-05-07T20:31:25.0884359Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:25.0884812Z 
2025-05-07T20:31:25.0885225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:25.0885746Z 
2025-05-07T20:31:25.0885869Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:25.0886289Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:25.0886702Z     T=2048,
2025-05-07T20:31:25.0886897Z     D=5120,
2025-05-07T20:31:25.0887088Z     scale_ub=1200.0,
2025-05-07T20:31:25.0887353Z     contiguous=True,
2025-05-07T20:31:25.0887592Z     compiled=False,
2025-05-07T20:31:25.0887790Z )
2025-05-07T20:31:25.6712697Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:25.6714223Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:25.6715770Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:25.6717215Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:25.6718589Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:25.6720291Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:25.6721599Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:25.6723129Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:25.6724556Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:25.6725799Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:25.6727021Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:25.6728228Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:25.6729262Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:25.6730285Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:25.6731506Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:25.6732795Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:25.6733906Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:25.6734945Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:25.6736119Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:25.6737498Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:25.6738594Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:25.6739508Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:25.6740250Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:25.6741266Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:25.8776163Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:25.8778244Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:25.8780914Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:25.8783974Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:25.8786714Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:25.8788468Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:25.8789773Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:25.8791266Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:25.8792675Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:25.8793925Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:25.8795151Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:25.8796371Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:25.8797429Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:25.8798486Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:25.8799714Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:25.8800998Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:25.8802111Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:25.8803162Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:25.8804717Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:25.8806183Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:25.8807420Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:25.8808358Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:25.8809096Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:25.8810222Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:26.4378888Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:26.4380035Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:26.4381415Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:26.4382875Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:26.4384261Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:26.4385637Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:26.4386950Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:26.4388379Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:26.4389907Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:26.4391154Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:26.4392376Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:26.4393573Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:26.4394614Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:26.4395638Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:26.4396863Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:26.4398495Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:26.4399606Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:26.4400649Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:26.4401956Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:26.4403306Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:26.4404654Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:26.4405560Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:26.4406300Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:26.4407334Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:26.4774924Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:26.4776286Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:26.4777620Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:26.4779088Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:26.4780465Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:26.4781853Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:26.4783162Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:26.4784533Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:26.4785956Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:26.4787220Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:26.4788645Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:26.4789939Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:26.4790965Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:26.4792137Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:26.4793355Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:26.4794633Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:26.4795748Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:26.4796783Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:26.4798015Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:26.4799370Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:26.4800425Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:26.4801337Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:26.4802076Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:26.4803090Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:27.2318268Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:27.2319040Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:27.2319413Z 
2025-05-07T20:31:27.2319535Z     @given(
2025-05-07T20:31:27.2319774Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:27.2320090Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:27.2320435Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:27.2320766Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:27.2321097Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:27.2321379Z     )
2025-05-07T20:31:27.2329667Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:27.2330147Z     def test_silu_mul_quant(
2025-05-07T20:31:27.2330396Z         self,
2025-05-07T20:31:27.2330599Z         T: int,
2025-05-07T20:31:27.2330804Z         D: int,
2025-05-07T20:31:27.2331022Z         scale_ub: Optional[float],
2025-05-07T20:31:27.2331304Z         contiguous: bool,
2025-05-07T20:31:27.2331555Z         compiled: bool,
2025-05-07T20:31:27.2331797Z     ) -> None:
2025-05-07T20:31:27.2332015Z         torch.manual_seed(2025)
2025-05-07T20:31:27.2332272Z     
2025-05-07T20:31:27.2332554Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:27.2332901Z     
2025-05-07T20:31:27.2333100Z         x_sign = torch.sign(x)
2025-05-07T20:31:27.2333735Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:27.2334046Z         x = x_sign * x_clamp
2025-05-07T20:31:27.2334301Z         x0 = x[:, :D]
2025-05-07T20:31:27.2334530Z         x1 = x[:, D:]
2025-05-07T20:31:27.2334738Z     
2025-05-07T20:31:27.2334932Z         if contiguous:
2025-05-07T20:31:27.2335317Z             x0 = x0.contiguous()
2025-05-07T20:31:27.2335577Z             x1 = x1.contiguous()
2025-05-07T20:31:27.2335822Z     
2025-05-07T20:31:27.2336020Z         if scale_ub is not None:
2025-05-07T20:31:27.2336293Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:27.2336640Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:27.2336955Z             )
2025-05-07T20:31:27.2337152Z         else:
2025-05-07T20:31:27.2337362Z             scale_ub_tensor = None
2025-05-07T20:31:27.2337627Z     
2025-05-07T20:31:27.2337868Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:27.2338188Z             op = silu_mul_quant
2025-05-07T20:31:27.2338446Z             if compiled:
2025-05-07T20:31:27.2338697Z                 op = torch.compile(op)
2025-05-07T20:31:27.2338995Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:27.2339278Z     
2025-05-07T20:31:27.2339473Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:27.2339646Z 
2025-05-07T20:31:27.2339748Z moe/activation_test.py:117: 
2025-05-07T20:31:27.2340046Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:27.2340389Z moe/activation_test.py:115: in fn
2025-05-07T20:31:27.2340671Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:27.2341368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:27.2342070Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:27.2342615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:27.2343308Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:27.2343975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:27.2344514Z     kernel = self.compile(
2025-05-07T20:31:27.2345066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:27.2345730Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:27.2346132Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:27.2346360Z 
2025-05-07T20:31:27.2346578Z self = <triton.compiler.compiler.ASTSource object at 0x7faba0eb0670>
2025-05-07T20:31:27.2347678Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:27.2349081Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faba62e1ee0>}
2025-05-07T20:31:27.2350553Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:27.2351615Z context = <triton._C.libtriton.ir.context object at 0x7faba5bebcb0>
2025-05-07T20:31:27.2351905Z 
2025-05-07T20:31:27.2352080Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:27.2352599Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:27.2353077Z                            module_map=module_map)
2025-05-07T20:31:27.2353447Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:27.2353886Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:27.2354146Z E       ^
2025-05-07T20:31:27.2354622Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:27.2355074Z 
2025-05-07T20:31:27.2355490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:27.2356112Z 
2025-05-07T20:31:27.2356218Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:27.2356632Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:27.2357035Z     T=2048,
2025-05-07T20:31:27.2357226Z     D=5120,
2025-05-07T20:31:27.2357424Z     scale_ub=1200.0,
2025-05-07T20:31:27.2357654Z     contiguous=True,
2025-05-07T20:31:27.2357874Z     compiled=True,
2025-05-07T20:31:27.2358086Z )
2025-05-07T20:31:27.2358411Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:27.2358913Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:27.2359189Z 
2025-05-07T20:31:27.2359269Z     @given(
2025-05-07T20:31:27.2359505Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:27.2359815Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:27.2360127Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:27.2360464Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:27.2360801Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:27.2361085Z     )
2025-05-07T20:31:27.2361443Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:27.2361892Z     def test_silu_mul_quant(
2025-05-07T20:31:27.2362140Z         self,
2025-05-07T20:31:27.2362341Z         T: int,
2025-05-07T20:31:27.2362549Z         D: int,
2025-05-07T20:31:27.2362761Z         scale_ub: Optional[float],
2025-05-07T20:31:27.2363034Z         contiguous: bool,
2025-05-07T20:31:27.2363279Z         compiled: bool,
2025-05-07T20:31:27.2363504Z     ) -> None:
2025-05-07T20:31:27.2363720Z         torch.manual_seed(2025)
2025-05-07T20:31:27.2363964Z     
2025-05-07T20:31:27.2364229Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:27.2364570Z     
2025-05-07T20:31:27.2364769Z         x_sign = torch.sign(x)
2025-05-07T20:31:27.2365069Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:27.2365382Z         x = x_sign * x_clamp
2025-05-07T20:31:27.2365627Z         x0 = x[:, :D]
2025-05-07T20:31:27.2365845Z         x1 = x[:, D:]
2025-05-07T20:31:27.2366054Z     
2025-05-07T20:31:27.2366243Z         if contiguous:
2025-05-07T20:31:27.2366481Z             x0 = x0.contiguous()
2025-05-07T20:31:27.2366735Z             x1 = x1.contiguous()
2025-05-07T20:31:27.2366976Z     
2025-05-07T20:31:27.2367169Z         if scale_ub is not None:
2025-05-07T20:31:27.2367442Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:27.2367818Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:27.2368151Z             )
2025-05-07T20:31:27.2368341Z         else:
2025-05-07T20:31:27.2368554Z             scale_ub_tensor = None
2025-05-07T20:31:27.2368810Z     
2025-05-07T20:31:27.2369041Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:27.2369356Z             op = silu_mul_quant
2025-05-07T20:31:27.2369620Z             if compiled:
2025-05-07T20:31:27.2369865Z                 op = torch.compile(op)
2025-05-07T20:31:27.2370166Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:27.2370446Z     
2025-05-07T20:31:27.2370635Z         y_fp8, y_scale = fn()
2025-05-07T20:31:27.2370923Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:27.2371221Z     
2025-05-07T20:31:27.2371464Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:27.2371797Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:27.2372094Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:27.2372499Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:27.2372859Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:27.2373174Z     
2025-05-07T20:31:27.2373380Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:27.2373577Z 
2025-05-07T20:31:27.2373678Z moe/activation_test.py:126: 
2025-05-07T20:31:27.2374054Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:27.2374396Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:27.2374727Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:27.2375517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:27.2376292Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:27.2376839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:27.2377527Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:27.2378248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:27.2378986Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:27.2379746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:27.2380493Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:27.2381227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:27.2381881Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:27.2382486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:27.2383010Z     fn()
2025-05-07T20:31:27.2383519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:27.2384104Z     self.fn.run(
2025-05-07T20:31:27.2384568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:27.2385113Z     kernel = self.compile(
2025-05-07T20:31:27.2385654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:27.2386315Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:27.2386706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:27.2386942Z 
2025-05-07T20:31:27.2387152Z self = <triton.compiler.compiler.ASTSource object at 0x7faba5c562e0>
2025-05-07T20:31:27.2388300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:27.2389685Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faba62dd5e0>}
2025-05-07T20:31:27.2391088Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:27.2392128Z context = <triton._C.libtriton.ir.context object at 0x7fab683aba70>
2025-05-07T20:31:27.2392420Z 
2025-05-07T20:31:27.2392587Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:27.2393119Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:27.2393587Z                            module_map=module_map)
2025-05-07T20:31:27.2394042Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:27.2394405Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:27.2394677Z E       ^
2025-05-07T20:31:27.2395150Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:27.2395687Z 
2025-05-07T20:31:27.2396104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:27.2396630Z 
2025-05-07T20:31:27.2396744Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:27.2397165Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:27.2397574Z     T=16384,
2025-05-07T20:31:27.2397780Z     D=7168,
2025-05-07T20:31:27.2397979Z     scale_ub=1200.0,
2025-05-07T20:31:27.2398202Z     contiguous=False,
2025-05-07T20:31:27.2398434Z     compiled=False,
2025-05-07T20:31:27.2398652Z )
2025-05-07T20:31:27.6552146Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:27.6553628Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:27.6555272Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:27.6556716Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:27.6558119Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:27.6559513Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:27.6560830Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:27.6562230Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:27.6563661Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:27.6564926Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:27.6566150Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:27.6567370Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:27.6568424Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:27.6569445Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:27.6571009Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:27.6572319Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:27.6573585Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:27.6574632Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:27.6575818Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:27.6577189Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:27.6578260Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:27.6579183Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:27.6579928Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:27.6580947Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:27.8180564Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:27.8181942Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:27.8183289Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:27.8184766Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:27.8186165Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:27.8187564Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:27.8188890Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:27.8190344Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:27.8191780Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:27.8193400Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:27.8194633Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:27.8196025Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:27.8197092Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:27.8198269Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:27.8199829Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:27.8201190Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:27.8202326Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:27.8203380Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:27.8204883Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:27.8206261Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:27.8207330Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:27.8208473Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:27.8209420Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:27.8210522Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:28.3051221Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:28.3052457Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:28.3053825Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:28.3055280Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:28.3056670Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:28.3058459Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:28.3059786Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:28.3061291Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:28.3062714Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:28.3063977Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:28.3065204Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:28.3066418Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:28.3067453Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:28.3068518Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:28.3069761Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:28.3071127Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:28.3072248Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:28.3073293Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:28.3074473Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:28.3075842Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:28.3076912Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:28.3077827Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:28.3078580Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:28.3079602Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:28.3444810Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:28.3446133Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:28.3447479Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:28.3449089Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:28.3450472Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:28.3451862Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:28.3453178Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:28.3454564Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:28.3455970Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:28.3457219Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:28.3458498Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:28.3459707Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:28.3460750Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:28.3461766Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:28.3462984Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:28.3464279Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:28.3465402Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:28.3466451Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:28.3467621Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:28.3468987Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:28.3470223Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:28.3471143Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:28.3471955Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:28.3472976Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:29.8223452Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:29.8224252Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:29.8224649Z 
2025-05-07T20:31:29.8224759Z     @given(
2025-05-07T20:31:29.8225109Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:29.8225430Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:29.8225734Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:29.8226070Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:29.8226398Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:29.8226686Z     )
2025-05-07T20:31:29.8227035Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:29.8227477Z     def test_silu_mul_quant(
2025-05-07T20:31:29.8227725Z         self,
2025-05-07T20:31:29.8227913Z         T: int,
2025-05-07T20:31:29.8228112Z         D: int,
2025-05-07T20:31:29.8228334Z         scale_ub: Optional[float],
2025-05-07T20:31:29.8228596Z         contiguous: bool,
2025-05-07T20:31:29.8228835Z         compiled: bool,
2025-05-07T20:31:29.8229061Z     ) -> None:
2025-05-07T20:31:29.8229270Z         torch.manual_seed(2025)
2025-05-07T20:31:29.8229511Z     
2025-05-07T20:31:29.8229786Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:29.8230208Z     
2025-05-07T20:31:29.8230404Z         x_sign = torch.sign(x)
2025-05-07T20:31:29.8230693Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:29.8230992Z         x = x_sign * x_clamp
2025-05-07T20:31:29.8231232Z         x0 = x[:, :D]
2025-05-07T20:31:29.8231446Z         x1 = x[:, D:]
2025-05-07T20:31:29.8231646Z     
2025-05-07T20:31:29.8231830Z         if contiguous:
2025-05-07T20:31:29.8232058Z             x0 = x0.contiguous()
2025-05-07T20:31:29.8232308Z             x1 = x1.contiguous()
2025-05-07T20:31:29.8232546Z     
2025-05-07T20:31:29.8232738Z         if scale_ub is not None:
2025-05-07T20:31:29.8233010Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:29.8233344Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:29.8233657Z             )
2025-05-07T20:31:29.8233847Z         else:
2025-05-07T20:31:29.8234059Z             scale_ub_tensor = None
2025-05-07T20:31:29.8234306Z     
2025-05-07T20:31:29.8234536Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:29.8234844Z             op = silu_mul_quant
2025-05-07T20:31:29.8235090Z             if compiled:
2025-05-07T20:31:29.8235334Z                 op = torch.compile(op)
2025-05-07T20:31:29.8235627Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:29.8235900Z     
2025-05-07T20:31:29.8236091Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:29.8236257Z 
2025-05-07T20:31:29.8236355Z moe/activation_test.py:117: 
2025-05-07T20:31:29.8236656Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:29.8236988Z moe/activation_test.py:115: in fn
2025-05-07T20:31:29.8237272Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:29.8237961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:29.8239054Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:29.8239606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:29.8240288Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:29.8240952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:29.8241628Z     kernel = self.compile(
2025-05-07T20:31:29.8242166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:29.8242811Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:29.8243205Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:29.8243432Z 
2025-05-07T20:31:29.8243650Z self = <triton.compiler.compiler.ASTSource object at 0x7faba6350340>
2025-05-07T20:31:29.8244753Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:29.8246151Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fab6864e160>}
2025-05-07T20:31:29.8247518Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:29.8248544Z context = <triton._C.libtriton.ir.context object at 0x7fab865b8130>
2025-05-07T20:31:29.8248831Z 
2025-05-07T20:31:29.8249000Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:29.8249514Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:29.8249987Z                            module_map=module_map)
2025-05-07T20:31:29.8250350Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:29.8250699Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:29.8250947Z E       ^
2025-05-07T20:31:29.8251414Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:29.8251870Z 
2025-05-07T20:31:29.8252290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:29.8252798Z 
2025-05-07T20:31:29.8252905Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:29.8253311Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:29.8253712Z     T=1,
2025-05-07T20:31:29.8253912Z     D=7168,
2025-05-07T20:31:29.8254108Z     scale_ub=None,
2025-05-07T20:31:29.8254322Z     contiguous=True,
2025-05-07T20:31:29.8254540Z     compiled=True,
2025-05-07T20:31:29.8254749Z )
2025-05-07T20:31:29.8255070Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:29.8255558Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:29.8255819Z 
2025-05-07T20:31:29.8255894Z     @given(
2025-05-07T20:31:29.8256122Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:29.8256441Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:29.8256741Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:29.8257067Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:29.8257396Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:29.8257675Z     )
2025-05-07T20:31:29.8266929Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:29.8267417Z     def test_silu_mul_quant(
2025-05-07T20:31:29.8267671Z         self,
2025-05-07T20:31:29.8267869Z         T: int,
2025-05-07T20:31:29.8268075Z         D: int,
2025-05-07T20:31:29.8268416Z         scale_ub: Optional[float],
2025-05-07T20:31:29.8268692Z         contiguous: bool,
2025-05-07T20:31:29.8268941Z         compiled: bool,
2025-05-07T20:31:29.8269172Z     ) -> None:
2025-05-07T20:31:29.8269386Z         torch.manual_seed(2025)
2025-05-07T20:31:29.8269641Z     
2025-05-07T20:31:29.8270089Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:29.8270431Z     
2025-05-07T20:31:29.8270634Z         x_sign = torch.sign(x)
2025-05-07T20:31:29.8270933Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:29.8271250Z         x = x_sign * x_clamp
2025-05-07T20:31:29.8271491Z         x0 = x[:, :D]
2025-05-07T20:31:29.8271716Z         x1 = x[:, D:]
2025-05-07T20:31:29.8271931Z     
2025-05-07T20:31:29.8272117Z         if contiguous:
2025-05-07T20:31:29.8272358Z             x0 = x0.contiguous()
2025-05-07T20:31:29.8272623Z             x1 = x1.contiguous()
2025-05-07T20:31:29.8272860Z     
2025-05-07T20:31:29.8273065Z         if scale_ub is not None:
2025-05-07T20:31:29.8273349Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:29.8273683Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:29.8273998Z             )
2025-05-07T20:31:29.8274200Z         else:
2025-05-07T20:31:29.8274411Z             scale_ub_tensor = None
2025-05-07T20:31:29.8274672Z     
2025-05-07T20:31:29.8274908Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:29.8275222Z             op = silu_mul_quant
2025-05-07T20:31:29.8275486Z             if compiled:
2025-05-07T20:31:29.8275755Z                 op = torch.compile(op)
2025-05-07T20:31:29.8276058Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:29.8276332Z     
2025-05-07T20:31:29.8276528Z         y_fp8, y_scale = fn()
2025-05-07T20:31:29.8276822Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:29.8277109Z     
2025-05-07T20:31:29.8277347Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:29.8277691Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:29.8277990Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:29.8278298Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:29.8278659Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:29.8278983Z     
2025-05-07T20:31:29.8279183Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:29.8279386Z 
2025-05-07T20:31:29.8279488Z moe/activation_test.py:126: 
2025-05-07T20:31:29.8279791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:29.8280122Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:29.8280456Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:29.8281252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:29.8282031Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:29.8282580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:29.8283280Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:29.8283976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:29.8284719Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:29.8285477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:29.8286238Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:29.8286973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:29.8287702Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:29.8288308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:29.8288844Z     fn()
2025-05-07T20:31:29.8289345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:29.8290008Z     self.fn.run(
2025-05-07T20:31:29.8290475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:29.8291000Z     kernel = self.compile(
2025-05-07T20:31:29.8291545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:29.8292208Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:29.8292605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:29.8292835Z 
2025-05-07T20:31:29.8293049Z self = <triton.compiler.compiler.ASTSource object at 0x7fab683a1fa0>
2025-05-07T20:31:29.8294146Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:29.8295549Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faba5c20280>}
2025-05-07T20:31:29.8296904Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:29.8297943Z context = <triton._C.libtriton.ir.context object at 0x7fab69032930>
2025-05-07T20:31:29.8298235Z 
2025-05-07T20:31:29.8298404Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:29.8298937Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:29.8299407Z                            module_map=module_map)
2025-05-07T20:31:29.8299775Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:29.8300134Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:29.8300409Z E       ^
2025-05-07T20:31:29.8300883Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:29.8301346Z 
2025-05-07T20:31:29.8301761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:29.8302289Z 
2025-05-07T20:31:29.8302396Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:29.8302813Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:29.8303213Z     T=4096,
2025-05-07T20:31:29.8303408Z     D=5120,
2025-05-07T20:31:29.8303608Z     scale_ub=None,
2025-05-07T20:31:29.8304231Z     contiguous=False,
2025-05-07T20:31:29.8304462Z     compiled=False,
2025-05-07T20:31:29.8304674Z )
2025-05-07T20:31:30.4614829Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:30.4616175Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:30.4617541Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:30.4619066Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:30.4620806Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:30.4622203Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:30.4623669Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:30.4625057Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:30.4626491Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:30.4627751Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:30.4629039Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:30.4630391Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:30.4631440Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:30.4632481Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:30.4633705Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:30.4635009Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:30.4636134Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:30.4637184Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:30.4638467Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:30.4639912Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:30.4640981Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:30.4641897Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:30.4642642Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:30.4643754Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:31.0694092Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:31.0695452Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:31.0697183Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:31.0698655Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:31.0700107Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:31.0701505Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:31.0702830Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:31.0704531Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:31.0705972Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:31.0707233Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:31.0708480Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:31.0710064Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:31.0711135Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:31.0712164Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:31.0713377Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:31.0714681Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:31.0715800Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:31.0716843Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:31.0718166Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:31.0719529Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:31.0720706Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:31.0721621Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:31.0722371Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:31.0723384Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:31.8447552Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:31.8448660Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:31.8450019Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:31.8451448Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:31.8452842Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:31.8454243Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:31.8455558Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:31.8456945Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:31.8458377Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:31.8459685Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:31.8460915Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:31.8462131Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:31.8463172Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:31.8464191Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:31.8465684Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:31.8466990Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:31.8468232Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:31.8469338Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:31.8470607Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:31.8471987Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:31.8473051Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:31.8473978Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:31.8474721Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:31.8475739Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:31.8850000Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:31.8851233Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:31.8852572Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:31.8854008Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:31.8855388Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:31.8856774Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:31.8858085Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:31.8859479Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:31.8860907Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:31.8862323Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:31.8863548Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:31.8864872Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:31.8865913Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:31.8866941Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:31.8868164Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:31.8869461Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:31.8870702Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:31.8871754Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:31.8872941Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:31.8874313Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:31.8875378Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:31.8876298Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:31.8877052Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:31.8878072Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:35.4648220Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:35.4649052Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:35.4649440Z 
2025-05-07T20:31:35.4649546Z     @given(
2025-05-07T20:31:35.4649857Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:35.4650253Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:35.4650571Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:35.4650921Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:35.4651246Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:35.4651538Z     )
2025-05-07T20:31:35.4651893Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:35.4652334Z     def test_silu_mul_quant(
2025-05-07T20:31:35.4652585Z         self,
2025-05-07T20:31:35.4652791Z         T: int,
2025-05-07T20:31:35.4652989Z         D: int,
2025-05-07T20:31:35.4653216Z         scale_ub: Optional[float],
2025-05-07T20:31:35.4653492Z         contiguous: bool,
2025-05-07T20:31:35.4654099Z         compiled: bool,
2025-05-07T20:31:35.4654331Z     ) -> None:
2025-05-07T20:31:35.4654554Z         torch.manual_seed(2025)
2025-05-07T20:31:35.4654801Z     
2025-05-07T20:31:35.4655068Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:35.4655412Z     
2025-05-07T20:31:35.4655606Z         x_sign = torch.sign(x)
2025-05-07T20:31:35.4656046Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:35.4656366Z         x = x_sign * x_clamp
2025-05-07T20:31:35.4656611Z         x0 = x[:, :D]
2025-05-07T20:31:35.4656824Z         x1 = x[:, D:]
2025-05-07T20:31:35.4657041Z     
2025-05-07T20:31:35.4657232Z         if contiguous:
2025-05-07T20:31:35.4657460Z             x0 = x0.contiguous()
2025-05-07T20:31:35.4657725Z             x1 = x1.contiguous()
2025-05-07T20:31:35.4657971Z     
2025-05-07T20:31:35.4658175Z         if scale_ub is not None:
2025-05-07T20:31:35.4658447Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:35.4658796Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:35.4659111Z             )
2025-05-07T20:31:35.4659310Z         else:
2025-05-07T20:31:35.4659519Z             scale_ub_tensor = None
2025-05-07T20:31:35.4659775Z     
2025-05-07T20:31:35.4660012Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:35.4660333Z             op = silu_mul_quant
2025-05-07T20:31:35.4660589Z             if compiled:
2025-05-07T20:31:35.4660845Z                 op = torch.compile(op)
2025-05-07T20:31:35.4661143Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:35.4661427Z     
2025-05-07T20:31:35.4661625Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:35.4661790Z 
2025-05-07T20:31:35.4661896Z moe/activation_test.py:117: 
2025-05-07T20:31:35.4662198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:35.4662535Z moe/activation_test.py:115: in fn
2025-05-07T20:31:35.4662815Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:35.4663520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:35.4664225Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:35.4664766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:35.4665451Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:35.4666111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:35.4666649Z     kernel = self.compile(
2025-05-07T20:31:35.4667191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:35.4667852Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:35.4668248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:35.4668477Z 
2025-05-07T20:31:35.4668695Z self = <triton.compiler.compiler.ASTSource object at 0x7fab68fbfa00>
2025-05-07T20:31:35.4669782Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:35.4671278Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fab69195f70>}
2025-05-07T20:31:35.4672620Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:35.4673658Z context = <triton._C.libtriton.ir.context object at 0x7fab68f8b630>
2025-05-07T20:31:35.4673944Z 
2025-05-07T20:31:35.4674425Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:35.4674950Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:35.4675419Z                            module_map=module_map)
2025-05-07T20:31:35.4675789Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:35.4676226Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:35.4676482Z E       ^
2025-05-07T20:31:35.4676964Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:35.4677414Z 
2025-05-07T20:31:35.4677838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:35.4678360Z 
2025-05-07T20:31:35.4678464Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:35.4678880Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:35.4679282Z     T=4096,
2025-05-07T20:31:35.4679479Z     D=7168,
2025-05-07T20:31:35.4679669Z     scale_ub=None,
2025-05-07T20:31:35.4679889Z     contiguous=False,
2025-05-07T20:31:35.4680118Z     compiled=False,
2025-05-07T20:31:35.4680324Z )
2025-05-07T20:31:35.4680644Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:35.4681149Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:35.4681421Z 
2025-05-07T20:31:35.4681504Z     @given(
2025-05-07T20:31:35.4681734Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:35.4682055Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:35.4682358Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:35.4682692Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:35.4683027Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:35.4683314Z     )
2025-05-07T20:31:35.4683660Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:35.4684107Z     def test_silu_mul_quant(
2025-05-07T20:31:35.4684354Z         self,
2025-05-07T20:31:35.4684545Z         T: int,
2025-05-07T20:31:35.4684748Z         D: int,
2025-05-07T20:31:35.4684970Z         scale_ub: Optional[float],
2025-05-07T20:31:35.4685239Z         contiguous: bool,
2025-05-07T20:31:35.4685485Z         compiled: bool,
2025-05-07T20:31:35.4685716Z     ) -> None:
2025-05-07T20:31:35.4685928Z         torch.manual_seed(2025)
2025-05-07T20:31:35.4686170Z     
2025-05-07T20:31:35.4686445Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:35.4686781Z     
2025-05-07T20:31:35.4686974Z         x_sign = torch.sign(x)
2025-05-07T20:31:35.4687266Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:35.4687568Z         x = x_sign * x_clamp
2025-05-07T20:31:35.4687807Z         x0 = x[:, :D]
2025-05-07T20:31:35.4688041Z         x1 = x[:, D:]
2025-05-07T20:31:35.4688254Z     
2025-05-07T20:31:35.4688434Z         if contiguous:
2025-05-07T20:31:35.4688673Z             x0 = x0.contiguous()
2025-05-07T20:31:35.4688934Z             x1 = x1.contiguous()
2025-05-07T20:31:35.4689173Z     
2025-05-07T20:31:35.4689369Z         if scale_ub is not None:
2025-05-07T20:31:35.4689660Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:35.4690032Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:35.4690347Z             )
2025-05-07T20:31:35.4690548Z         else:
2025-05-07T20:31:35.4690759Z             scale_ub_tensor = None
2025-05-07T20:31:35.4691006Z     
2025-05-07T20:31:35.4691242Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:35.4691562Z             op = silu_mul_quant
2025-05-07T20:31:35.4691807Z             if compiled:
2025-05-07T20:31:35.4692057Z                 op = torch.compile(op)
2025-05-07T20:31:35.4692359Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:35.4692633Z     
2025-05-07T20:31:35.4692830Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:35.4692993Z 
2025-05-07T20:31:35.4693183Z moe/activation_test.py:117: 
2025-05-07T20:31:35.4693475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:35.4693810Z moe/activation_test.py:115: in fn
2025-05-07T20:31:35.4694095Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:35.4694870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:35.4695566Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:35.4696106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:35.4696788Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:35.4697440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:35.4697978Z     kernel = self.compile(
2025-05-07T20:31:35.4698525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:35.4699178Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:35.4699572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:35.4699814Z 
2025-05-07T20:31:35.4700033Z self = <triton.compiler.compiler.ASTSource object at 0x7fab68fbf190>
2025-05-07T20:31:35.4701167Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:35.4702546Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faba0eadee0>}
2025-05-07T20:31:35.4704071Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:35.4705237Z context = <triton._C.libtriton.ir.context object at 0x7fab6826ad70>
2025-05-07T20:31:35.4705534Z 
2025-05-07T20:31:35.4705701Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:35.4706231Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:35.4706693Z                            module_map=module_map)
2025-05-07T20:31:35.4707064Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:35.4707422Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:35.4707682Z E       ^
2025-05-07T20:31:35.4708148Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:35.4708605Z 
2025-05-07T20:31:35.4709035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:35.4709554Z 
2025-05-07T20:31:35.4709662Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:35.4710111Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:35.4710512Z     T=128,
2025-05-07T20:31:35.4710706Z     D=7168,
2025-05-07T20:31:35.4710899Z     scale_ub=None,
2025-05-07T20:31:35.4711111Z     contiguous=False,
2025-05-07T20:31:35.4711337Z     compiled=True,
2025-05-07T20:31:35.4711566Z )
2025-05-07T20:31:35.5503465Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:35.5505092Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:35.5505641Z 
2025-05-07T20:31:35.5505799Z     @given(
2025-05-07T20:31:35.5506264Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:35.5506885Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:35.5507803Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:35.5508468Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:35.5509121Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:35.5509698Z     )
2025-05-07T20:31:35.5510124Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:35.5518934Z     def test_silu_mul_quant(
2025-05-07T20:31:35.5519197Z         self,
2025-05-07T20:31:35.5519402Z         T: int,
2025-05-07T20:31:35.5519606Z         D: int,
2025-05-07T20:31:35.5519826Z         scale_ub: Optional[float],
2025-05-07T20:31:35.5520112Z         contiguous: bool,
2025-05-07T20:31:35.5520361Z         compiled: bool,
2025-05-07T20:31:35.5520596Z     ) -> None:
2025-05-07T20:31:35.5520812Z         torch.manual_seed(2025)
2025-05-07T20:31:35.5521065Z     
2025-05-07T20:31:35.5521353Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:35.5521698Z     
2025-05-07T20:31:35.5521901Z         x_sign = torch.sign(x)
2025-05-07T20:31:35.5522211Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:35.5522524Z         x = x_sign * x_clamp
2025-05-07T20:31:35.5522778Z         x0 = x[:, :D]
2025-05-07T20:31:35.5523005Z         x1 = x[:, D:]
2025-05-07T20:31:35.5523210Z     
2025-05-07T20:31:35.5523405Z         if contiguous:
2025-05-07T20:31:35.5523655Z             x0 = x0.contiguous()
2025-05-07T20:31:35.5523918Z             x1 = x1.contiguous()
2025-05-07T20:31:35.5524171Z     
2025-05-07T20:31:35.5524367Z         if scale_ub is not None:
2025-05-07T20:31:35.5524639Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:35.5524987Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:35.5525301Z             )
2025-05-07T20:31:35.5525499Z         else:
2025-05-07T20:31:35.5525707Z             scale_ub_tensor = None
2025-05-07T20:31:35.5525964Z     
2025-05-07T20:31:35.5526209Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:35.5526529Z             op = silu_mul_quant
2025-05-07T20:31:35.5526785Z             if compiled:
2025-05-07T20:31:35.5527044Z                 op = torch.compile(op)
2025-05-07T20:31:35.5527342Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:35.5527626Z     
2025-05-07T20:31:35.5527829Z         y_fp8, y_scale = fn()
2025-05-07T20:31:35.5528120Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:35.5528420Z     
2025-05-07T20:31:35.5528663Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:35.5528997Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:35.5529299Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:35.5529629Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:35.5530037Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:35.5530346Z     
2025-05-07T20:31:35.5530553Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:35.5530756Z 
2025-05-07T20:31:35.5530869Z moe/activation_test.py:126: 
2025-05-07T20:31:35.5531167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:35.5531508Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:35.5531843Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:35.5532655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:35.5533429Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:35.5533996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:35.5534696Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:35.5535388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:35.5536240Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:35.5537008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:35.5537773Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:35.5538508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:35.5539243Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:35.5539856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:35.5540425Z     fn()
2025-05-07T20:31:35.5540932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:35.5541509Z     self.fn.run(
2025-05-07T20:31:35.5541970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:35.5542505Z     kernel = self.compile(
2025-05-07T20:31:35.5543044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:35.5543694Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:35.5544098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:35.5544340Z 
2025-05-07T20:31:35.5544548Z self = <triton.compiler.compiler.ASTSource object at 0x7fab68282a30>
2025-05-07T20:31:35.5545665Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:35.5547072Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faba1a29700>}
2025-05-07T20:31:35.5548448Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:35.5549493Z context = <triton._C.libtriton.ir.context object at 0x7faa04bc5d70>
2025-05-07T20:31:35.5549789Z 
2025-05-07T20:31:35.5550028Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:35.5550557Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:35.5551025Z                            module_map=module_map)
2025-05-07T20:31:35.5551396Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:35.5551755Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:35.5552019Z E       ^
2025-05-07T20:31:35.5552492Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:35.5552958Z 
2025-05-07T20:31:35.5553377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:35.5553895Z 
2025-05-07T20:31:35.5554004Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:35.5554418Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:35.5554829Z     T=128,
2025-05-07T20:31:35.5555017Z     D=7168,
2025-05-07T20:31:35.5555207Z     scale_ub=None,
2025-05-07T20:31:35.5555427Z     contiguous=False,
2025-05-07T20:31:35.5555656Z     compiled=False,
2025-05-07T20:31:35.5555859Z )
2025-05-07T20:31:35.8037514Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:35.8038140Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:35.8038475Z 
2025-05-07T20:31:35.8038556Z     @given(
2025-05-07T20:31:35.8038790Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:35.8039384Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:35.8039696Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:35.8040029Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:35.8040363Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:35.8040645Z     )
2025-05-07T20:31:35.8041125Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:35.8041567Z     def test_silu_mul_quant(
2025-05-07T20:31:35.8041810Z         self,
2025-05-07T20:31:35.8042012Z         T: int,
2025-05-07T20:31:35.8042213Z         D: int,
2025-05-07T20:31:35.8042431Z         scale_ub: Optional[float],
2025-05-07T20:31:35.8042705Z         contiguous: bool,
2025-05-07T20:31:35.8042949Z         compiled: bool,
2025-05-07T20:31:35.8043170Z     ) -> None:
2025-05-07T20:31:35.8043392Z         torch.manual_seed(2025)
2025-05-07T20:31:35.8043638Z     
2025-05-07T20:31:35.8043917Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:35.8044261Z     
2025-05-07T20:31:35.8044462Z         x_sign = torch.sign(x)
2025-05-07T20:31:35.8044755Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:35.8045061Z         x = x_sign * x_clamp
2025-05-07T20:31:35.8045304Z         x0 = x[:, :D]
2025-05-07T20:31:35.8045529Z         x1 = x[:, D:]
2025-05-07T20:31:35.8045732Z     
2025-05-07T20:31:35.8045926Z         if contiguous:
2025-05-07T20:31:35.8046158Z             x0 = x0.contiguous()
2025-05-07T20:31:35.8046417Z             x1 = x1.contiguous()
2025-05-07T20:31:35.8046659Z     
2025-05-07T20:31:35.8046851Z         if scale_ub is not None:
2025-05-07T20:31:35.8047121Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:35.8047463Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:35.8047776Z             )
2025-05-07T20:31:35.8047974Z         else:
2025-05-07T20:31:35.8048185Z             scale_ub_tensor = None
2025-05-07T20:31:35.8048436Z     
2025-05-07T20:31:35.8048670Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:35.8048986Z             op = silu_mul_quant
2025-05-07T20:31:35.8049239Z             if compiled:
2025-05-07T20:31:35.8049488Z                 op = torch.compile(op)
2025-05-07T20:31:35.8049781Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:35.8050064Z     
2025-05-07T20:31:35.8050257Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:35.8050421Z 
2025-05-07T20:31:35.8050525Z moe/activation_test.py:117: 
2025-05-07T20:31:35.8050822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:35.8051154Z moe/activation_test.py:115: in fn
2025-05-07T20:31:35.8051432Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:35.8052129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:35.8052820Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:35.8053361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:35.8054035Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:35.8054693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:35.8055226Z     kernel = self.compile(
2025-05-07T20:31:35.8055760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:35.8056412Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:35.8056812Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:35.8057043Z 
2025-05-07T20:31:35.8057258Z self = <triton.compiler.compiler.ASTSource object at 0x7faa04b5f850>
2025-05-07T20:31:35.8058422Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:35.8059806Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fab681d7d30>}
2025-05-07T20:31:35.8061232Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:35.8062256Z context = <triton._C.libtriton.ir.context object at 0x7fab68e0d330>
2025-05-07T20:31:35.8062546Z 
2025-05-07T20:31:35.8062721Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:35.8063237Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:35.8063713Z                            module_map=module_map)
2025-05-07T20:31:35.8064085Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:35.8064438Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:35.8064706Z E       ^
2025-05-07T20:31:35.8065185Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:35.8065641Z 
2025-05-07T20:31:35.8066061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:35.8066571Z 
2025-05-07T20:31:35.8066677Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:35.8067104Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:35.8067513Z     T=4096,
2025-05-07T20:31:35.8067699Z     D=5120,
2025-05-07T20:31:35.8067893Z     scale_ub=1200.0,
2025-05-07T20:31:35.8068125Z     contiguous=True,
2025-05-07T20:31:35.8068354Z     compiled=False,
2025-05-07T20:31:35.8068554Z )
2025-05-07T20:31:35.8068880Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:35.8069376Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:35.8069650Z 
2025-05-07T20:31:35.8069728Z     @given(
2025-05-07T20:31:35.8070033Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:35.8070349Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:35.8070649Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:35.8070976Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:35.8071304Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:35.8071583Z     )
2025-05-07T20:31:35.8071932Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:35.8072374Z     def test_silu_mul_quant(
2025-05-07T20:31:35.8072625Z         self,
2025-05-07T20:31:35.8072823Z         T: int,
2025-05-07T20:31:35.8073023Z         D: int,
2025-05-07T20:31:35.8073246Z         scale_ub: Optional[float],
2025-05-07T20:31:35.8073515Z         contiguous: bool,
2025-05-07T20:31:35.8073754Z         compiled: bool,
2025-05-07T20:31:35.8073978Z     ) -> None:
2025-05-07T20:31:35.8074189Z         torch.manual_seed(2025)
2025-05-07T20:31:35.8074437Z     
2025-05-07T20:31:35.8074711Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:35.8075052Z     
2025-05-07T20:31:35.8075251Z         x_sign = torch.sign(x)
2025-05-07T20:31:35.8075543Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:35.8075847Z         x = x_sign * x_clamp
2025-05-07T20:31:35.8076090Z         x0 = x[:, :D]
2025-05-07T20:31:35.8076313Z         x1 = x[:, D:]
2025-05-07T20:31:35.8076520Z     
2025-05-07T20:31:35.8076714Z         if contiguous:
2025-05-07T20:31:35.8076951Z             x0 = x0.contiguous()
2025-05-07T20:31:35.8077216Z             x1 = x1.contiguous()
2025-05-07T20:31:35.8077453Z     
2025-05-07T20:31:35.8077651Z         if scale_ub is not None:
2025-05-07T20:31:35.8078015Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:35.8078346Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:35.8078663Z             )
2025-05-07T20:31:35.8078859Z         else:
2025-05-07T20:31:35.8079073Z             scale_ub_tensor = None
2025-05-07T20:31:35.8079328Z     
2025-05-07T20:31:35.8079671Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:35.8079982Z             op = silu_mul_quant
2025-05-07T20:31:35.8080235Z             if compiled:
2025-05-07T20:31:35.8080483Z                 op = torch.compile(op)
2025-05-07T20:31:35.8080776Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:35.8081052Z     
2025-05-07T20:31:35.8081246Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:35.8081410Z 
2025-05-07T20:31:35.8081513Z moe/activation_test.py:117: 
2025-05-07T20:31:35.8081818Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:35.8082148Z moe/activation_test.py:115: in fn
2025-05-07T20:31:35.8082435Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:35.8083114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:35.8083798Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:35.8084339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:35.8085017Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:35.8085669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:35.8086198Z     kernel = self.compile(
2025-05-07T20:31:35.8086734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:35.8087383Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:35.8087774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:35.8088004Z 
2025-05-07T20:31:35.8088208Z self = <triton.compiler.compiler.ASTSource object at 0x7faa04c01e80>
2025-05-07T20:31:35.8089286Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:35.8090666Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fab69195940>}
2025-05-07T20:31:35.8092002Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:35.8093028Z context = <triton._C.libtriton.ir.context object at 0x7faa04b94470>
2025-05-07T20:31:35.8093323Z 
2025-05-07T20:31:35.8093488Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:35.8094012Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:35.8094474Z                            module_map=module_map)
2025-05-07T20:31:35.8094844Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:35.8095196Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:35.8095448Z E       ^
2025-05-07T20:31:35.8095916Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:35.8096373Z 
2025-05-07T20:31:35.8096786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:35.8097293Z 
2025-05-07T20:31:35.8097402Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:35.8097900Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:35.8098315Z     T=1,
2025-05-07T20:31:35.8098497Z     D=5120,
2025-05-07T20:31:35.8098685Z     scale_ub=None,
2025-05-07T20:31:35.8098901Z     contiguous=True,
2025-05-07T20:31:35.8099127Z     compiled=True,
2025-05-07T20:31:35.8099330Z )
2025-05-07T20:31:36.4083161Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:36.4085697Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:36.4088392Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:36.4090624Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:36.4092012Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:36.4093403Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.4094715Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:36.4096091Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.4097512Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:36.4098763Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:36.4099980Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:36.4101194Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:36.4102236Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:36.4103256Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:36.4104663Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:36.4105957Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:36.4107071Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:36.4108240Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:36.4109418Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:36.4111230Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:36.4112293Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.4113205Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:36.4113942Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:36.4114964Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.5971476Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:36.5973658Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:36.5976321Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:36.5979159Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:36.5980931Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:36.5982316Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.5983618Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:36.5984989Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.5986402Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:36.5987650Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:36.5988883Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:36.5990194Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:36.5991396Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:36.5992420Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:36.5993639Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:36.5995030Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:36.5996146Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:36.5997187Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:36.5998361Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:36.5999718Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:36.6000831Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.6001744Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:36.6002487Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:36.6003502Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.1047401Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:37.1048516Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:37.1049858Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:37.1051278Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:37.1052667Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:37.1054050Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.1055358Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:37.1056729Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.1058290Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:37.1059551Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:37.1060886Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:37.1062097Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:37.1063139Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:37.1064163Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:37.1065388Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:37.1066675Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:37.1067793Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:37.1068837Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:37.1070074Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:37.1071435Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:37.1072511Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.1073427Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.1074177Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:37.1075203Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.1446671Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:37.1447882Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:37.1449226Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:37.1450645Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:37.1452195Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:37.1453592Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.1455020Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:37.1456396Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.1457826Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:37.1459081Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:37.1460302Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:37.1461517Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:37.1462563Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:37.1463595Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:37.1464819Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:37.1466118Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:37.1467236Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:37.1468280Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:37.1469468Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:37.1470901Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:37.1471968Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.1472884Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.1473622Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:37.1474719Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.4884455Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:37.4885903Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:37.4886621Z 
2025-05-07T20:31:37.4886829Z     @given(
2025-05-07T20:31:37.4887457Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:37.4888402Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:37.4895667Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:37.4896050Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:37.4896384Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:37.4896672Z     )
2025-05-07T20:31:37.4897038Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:37.4897487Z     def test_silu_mul_quant(
2025-05-07T20:31:37.4897729Z         self,
2025-05-07T20:31:37.4897930Z         T: int,
2025-05-07T20:31:37.4898130Z         D: int,
2025-05-07T20:31:37.4898354Z         scale_ub: Optional[float],
2025-05-07T20:31:37.4898634Z         contiguous: bool,
2025-05-07T20:31:37.4898876Z         compiled: bool,
2025-05-07T20:31:37.4899105Z     ) -> None:
2025-05-07T20:31:37.4899326Z         torch.manual_seed(2025)
2025-05-07T20:31:37.4899576Z     
2025-05-07T20:31:37.4899854Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:37.4900206Z     
2025-05-07T20:31:37.4900406Z         x_sign = torch.sign(x)
2025-05-07T20:31:37.4900703Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:37.4901009Z         x = x_sign * x_clamp
2025-05-07T20:31:37.4901253Z         x0 = x[:, :D]
2025-05-07T20:31:37.4901470Z         x1 = x[:, D:]
2025-05-07T20:31:37.4901675Z     
2025-05-07T20:31:37.4901866Z         if contiguous:
2025-05-07T20:31:37.4902103Z             x0 = x0.contiguous()
2025-05-07T20:31:37.4902360Z             x1 = x1.contiguous()
2025-05-07T20:31:37.4902605Z     
2025-05-07T20:31:37.4902803Z         if scale_ub is not None:
2025-05-07T20:31:37.4903076Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:37.4903418Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:37.4903980Z             )
2025-05-07T20:31:37.4904174Z         else:
2025-05-07T20:31:37.4904390Z             scale_ub_tensor = None
2025-05-07T20:31:37.4904648Z     
2025-05-07T20:31:37.4904872Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:37.4905186Z             op = silu_mul_quant
2025-05-07T20:31:37.4905439Z             if compiled:
2025-05-07T20:31:37.4905684Z                 op = torch.compile(op)
2025-05-07T20:31:37.4905980Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.4906251Z     
2025-05-07T20:31:37.4906444Z         y_fp8, y_scale = fn()
2025-05-07T20:31:37.4906724Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:37.4907015Z     
2025-05-07T20:31:37.4907248Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:37.4907579Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:37.4907868Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:37.4908184Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:37.4908535Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:37.4908851Z     
2025-05-07T20:31:37.4909051Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:37.4909247Z 
2025-05-07T20:31:37.4909353Z moe/activation_test.py:126: 
2025-05-07T20:31:37.4909646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.4910047Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:37.4910374Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:37.4911161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:37.4912077Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:37.4912625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:37.4913308Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:37.4913988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:37.4914831Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:37.4915576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:37.4916325Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:37.4917048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:37.4917690Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:37.4918286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:37.4918806Z     fn()
2025-05-07T20:31:37.4919307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:37.4919887Z     self.fn.run(
2025-05-07T20:31:37.4920353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:37.4920880Z     kernel = self.compile(
2025-05-07T20:31:37.4921417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:37.4922064Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.4922463Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.4922691Z 
2025-05-07T20:31:37.4922916Z self = <triton.compiler.compiler.ASTSource object at 0x7faba5b4a370>
2025-05-07T20:31:37.4924007Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:37.4925411Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fab68cea820>}
2025-05-07T20:31:37.4926753Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:37.4927776Z context = <triton._C.libtriton.ir.context object at 0x7faa0473f1f0>
2025-05-07T20:31:37.4928061Z 
2025-05-07T20:31:37.4928237Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:37.4928765Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.4929226Z                            module_map=module_map)
2025-05-07T20:31:37.4929593Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.4929946Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:37.4930223Z E       ^
2025-05-07T20:31:37.4930693Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.4931147Z 
2025-05-07T20:31:37.4931559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:37.4932066Z 
2025-05-07T20:31:37.4932175Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:37.4932580Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:37.4932983Z     T=2048,
2025-05-07T20:31:37.4933167Z     D=5120,
2025-05-07T20:31:37.4933353Z     scale_ub=None,
2025-05-07T20:31:37.4933656Z     contiguous=True,
2025-05-07T20:31:37.4933879Z     compiled=True,
2025-05-07T20:31:37.4934078Z )
2025-05-07T20:31:37.9990358Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:37.9991871Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:37.9993201Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:37.9994616Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:37.9996005Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:37.9997380Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.9998686Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:38.0000050Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.0001461Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:38.0002705Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:38.0004091Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:38.0005291Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:38.0006321Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:38.0007345Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:38.0008867Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:38.0010465Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:38.0011574Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:38.0012614Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:38.0013916Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:38.0015274Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:38.0016434Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.0017339Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.0018077Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:38.0019100Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.1869104Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:38.1870463Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:38.1871801Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:38.1873209Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:38.1874584Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:38.1875962Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.1877260Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:38.1878624Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.1880054Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:38.1881298Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:38.1882519Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:38.1883728Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:38.1884764Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:38.1885949Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:38.1887172Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:38.1888558Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:38.1889670Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:38.1890711Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:38.1891889Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:38.1893241Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:38.1894307Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.1895209Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.1895951Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:38.1896965Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.6913523Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:38.6914741Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:38.6916089Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:38.6917511Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:38.6918895Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:38.6920272Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.6921576Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:38.6922940Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.6924509Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:38.6925754Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:38.6926969Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:38.6928279Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:38.6929312Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:38.6930326Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:38.6931547Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:38.6932816Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:38.6933932Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:38.6934966Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:38.6936153Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:38.6937501Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:38.6938553Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.6939464Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.6940204Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:38.6941276Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.7312594Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:38.7313895Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:38.7315229Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:38.7316642Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:38.7318195Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:38.7319574Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.7320866Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:38.7322343Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.7323757Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:38.7325003Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:38.7326221Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:38.7327425Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:38.7328450Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:38.7329466Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:38.7330725Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:38.7332009Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:38.7333114Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:38.7334152Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:38.7335325Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:38.7336677Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:38.7337738Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.7338649Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.7339390Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:38.7340406Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.2168240Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:39.2169160Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:39.2169506Z 
2025-05-07T20:31:39.2169589Z     @given(
2025-05-07T20:31:39.2169828Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:39.2170140Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:39.2170596Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:39.2171493Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:39.2172154Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:39.2172719Z     )
2025-05-07T20:31:39.2173416Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:39.2174298Z     def test_silu_mul_quant(
2025-05-07T20:31:39.2174779Z         self,
2025-05-07T20:31:39.2175166Z         T: int,
2025-05-07T20:31:39.2175555Z         D: int,
2025-05-07T20:31:39.2175987Z         scale_ub: Optional[float],
2025-05-07T20:31:39.2176535Z         contiguous: bool,
2025-05-07T20:31:39.2177019Z         compiled: bool,
2025-05-07T20:31:39.2177462Z     ) -> None:
2025-05-07T20:31:39.2177891Z         torch.manual_seed(2025)
2025-05-07T20:31:39.2178381Z     
2025-05-07T20:31:39.2178917Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:39.2179603Z     
2025-05-07T20:31:39.2179989Z         x_sign = torch.sign(x)
2025-05-07T20:31:39.2180577Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:39.2181022Z         x = x_sign * x_clamp
2025-05-07T20:31:39.2181267Z         x0 = x[:, :D]
2025-05-07T20:31:39.2181483Z         x1 = x[:, D:]
2025-05-07T20:31:39.2181685Z     
2025-05-07T20:31:39.2181873Z         if contiguous:
2025-05-07T20:31:39.2182108Z             x0 = x0.contiguous()
2025-05-07T20:31:39.2182365Z             x1 = x1.contiguous()
2025-05-07T20:31:39.2182611Z     
2025-05-07T20:31:39.2182805Z         if scale_ub is not None:
2025-05-07T20:31:39.2183077Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:39.2183424Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:39.2183736Z             )
2025-05-07T20:31:39.2183927Z         else:
2025-05-07T20:31:39.2184142Z             scale_ub_tensor = None
2025-05-07T20:31:39.2184396Z     
2025-05-07T20:31:39.2184630Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:39.2184956Z             op = silu_mul_quant
2025-05-07T20:31:39.2185213Z             if compiled:
2025-05-07T20:31:39.2185465Z                 op = torch.compile(op)
2025-05-07T20:31:39.2185762Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:39.2186042Z     
2025-05-07T20:31:39.2186239Z         y_fp8, y_scale = fn()
2025-05-07T20:31:39.2186531Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:39.2186823Z     
2025-05-07T20:31:39.2187063Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:39.2187398Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:39.2187695Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:39.2188017Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:39.2188375Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:39.2188692Z     
2025-05-07T20:31:39.2188898Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:39.2189095Z 
2025-05-07T20:31:39.2189207Z moe/activation_test.py:126: 
2025-05-07T20:31:39.2189504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.2189905Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:39.2190237Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:39.2191030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:39.2191796Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:39.2192348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:39.2193187Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:39.2194009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:39.2194733Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:39.2195794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:39.2196543Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:39.2197268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:39.2197911Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:39.2198515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:39.2199033Z     fn()
2025-05-07T20:31:39.2199538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:39.2200126Z     self.fn.run(
2025-05-07T20:31:39.2200597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:39.2201128Z     kernel = self.compile(
2025-05-07T20:31:39.2201672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:39.2202325Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.2202722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.2202956Z 
2025-05-07T20:31:39.2203169Z self = <triton.compiler.compiler.ASTSource object at 0x7faba19267c0>
2025-05-07T20:31:39.2204441Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:39.2205832Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa04c8d940>}
2025-05-07T20:31:39.2207195Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:39.2208212Z context = <triton._C.libtriton.ir.context object at 0x7faa044e8170>
2025-05-07T20:31:39.2208505Z 
2025-05-07T20:31:39.2208672Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:39.2209197Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.2209663Z                            module_map=module_map)
2025-05-07T20:31:39.2210028Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.2210389Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:39.2210657Z E       ^
2025-05-07T20:31:39.2211118Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.2211583Z 
2025-05-07T20:31:39.2211998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:39.2212519Z 
2025-05-07T20:31:39.2212627Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:39.2213057Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:39.2213461Z     T=128,
2025-05-07T20:31:39.2213655Z     D=5120,
2025-05-07T20:31:39.2213852Z     scale_ub=None,
2025-05-07T20:31:39.2214064Z     contiguous=True,
2025-05-07T20:31:39.2214295Z     compiled=True,
2025-05-07T20:31:39.2214501Z )
2025-05-07T20:31:39.7427818Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:39.7430328Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:39.7431924Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:39.7433352Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:39.7434732Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:39.7436114Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.7437418Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:39.7438795Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.7440210Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:39.7441451Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:39.7442659Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:39.7443871Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:39.7444905Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:39.7445918Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:39.7447129Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:39.7448406Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:39.7449522Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:39.7450563Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:39.7451735Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:39.7453163Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:39.7454227Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.7455211Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:39.7455949Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:39.7456957Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.9320088Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:39.9321329Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:39.9322672Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:39.9324088Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:39.9325467Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:39.9326840Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.9328140Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:39.9329507Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.9330958Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:39.9332213Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:39.9333422Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:39.9334625Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:39.9335667Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:39.9336679Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:39.9338062Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:39.9339348Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:39.9340578Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:39.9341676Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:39.9342857Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:39.9344227Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:39.9345294Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.9346218Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:39.9346965Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:39.9347980Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.4442163Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:40.4443508Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:40.4444841Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:40.4446266Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:40.4447661Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:40.4449050Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.4450359Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:40.4451731Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.4453140Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:40.4454544Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:40.4455776Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:40.4462913Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:40.4463985Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:40.4465017Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:40.4466245Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:40.4467533Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:40.4468647Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:40.4469702Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:40.4470947Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:40.4472300Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:40.4473369Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.4474279Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:40.4475035Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:40.4476049Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.4834593Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:40.4837124Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:40.4839770Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:40.4841685Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:40.4843060Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:40.4844601Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.4845909Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:40.4847389Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.4848807Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:40.4850054Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:40.4851322Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:40.4852529Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:40.4853565Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:40.4854585Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:40.4855807Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:40.4857084Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:40.4858197Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:40.4859238Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:40.4860413Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:40.4861820Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:40.4862875Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.4863787Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:40.4864536Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:40.4865549Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.9399982Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.9400678Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:40.9401056Z 
2025-05-07T20:31:40.9401182Z     @given(
2025-05-07T20:31:40.9401723Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.9402170Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.9402503Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.9402828Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.9403184Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.9403601Z     )
2025-05-07T20:31:40.9404129Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.9404570Z     def test_silu_mul_quant(
2025-05-07T20:31:40.9404809Z         self,
2025-05-07T20:31:40.9405010Z         T: int,
2025-05-07T20:31:40.9405208Z         D: int,
2025-05-07T20:31:40.9405428Z         scale_ub: Optional[float],
2025-05-07T20:31:40.9405700Z         contiguous: bool,
2025-05-07T20:31:40.9405935Z         compiled: bool,
2025-05-07T20:31:40.9406163Z     ) -> None:
2025-05-07T20:31:40.9406382Z         torch.manual_seed(2025)
2025-05-07T20:31:40.9406619Z     
2025-05-07T20:31:40.9406899Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.9407239Z     
2025-05-07T20:31:40.9407425Z         x_sign = torch.sign(x)
2025-05-07T20:31:40.9407714Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:40.9408024Z         x = x_sign * x_clamp
2025-05-07T20:31:40.9408273Z         x0 = x[:, :D]
2025-05-07T20:31:40.9408479Z         x1 = x[:, D:]
2025-05-07T20:31:40.9408683Z     
2025-05-07T20:31:40.9408870Z         if contiguous:
2025-05-07T20:31:40.9409095Z             x0 = x0.contiguous()
2025-05-07T20:31:40.9409357Z             x1 = x1.contiguous()
2025-05-07T20:31:40.9409598Z     
2025-05-07T20:31:40.9409785Z         if scale_ub is not None:
2025-05-07T20:31:40.9410058Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:40.9410394Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:40.9410698Z             )
2025-05-07T20:31:40.9410895Z         else:
2025-05-07T20:31:40.9411111Z             scale_ub_tensor = None
2025-05-07T20:31:40.9411355Z     
2025-05-07T20:31:40.9411589Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:40.9411903Z             op = silu_mul_quant
2025-05-07T20:31:40.9412147Z             if compiled:
2025-05-07T20:31:40.9412393Z                 op = torch.compile(op)
2025-05-07T20:31:40.9412696Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:40.9412964Z     
2025-05-07T20:31:40.9413153Z         y_fp8, y_scale = fn()
2025-05-07T20:31:40.9413437Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:40.9413730Z     
2025-05-07T20:31:40.9413956Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:40.9414289Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:40.9414582Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:40.9414892Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:40.9415248Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:40.9415560Z     
2025-05-07T20:31:40.9415753Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:40.9415953Z 
2025-05-07T20:31:40.9416054Z moe/activation_test.py:126: 
2025-05-07T20:31:40.9416350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.9416687Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:40.9417007Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:40.9417796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:40.9418559Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:40.9419099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:40.9419784Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:40.9420591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:40.9421369Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:40.9422110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:40.9422956Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:40.9423678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:40.9424315Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:40.9424910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:40.9425427Z     fn()
2025-05-07T20:31:40.9425931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:40.9426502Z     self.fn.run(
2025-05-07T20:31:40.9426965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:40.9427490Z     kernel = self.compile(
2025-05-07T20:31:40.9428029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:40.9428682Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.9429077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.9429307Z 
2025-05-07T20:31:40.9429523Z self = <triton.compiler.compiler.ASTSource object at 0x7fab800559a0>
2025-05-07T20:31:40.9430680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:40.9432114Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa04853c10>}
2025-05-07T20:31:40.9433463Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:40.9434495Z context = <triton._C.libtriton.ir.context object at 0x7faa03f3ecb0>
2025-05-07T20:31:40.9434782Z 
2025-05-07T20:31:40.9434956Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:40.9435469Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.9435931Z                            module_map=module_map)
2025-05-07T20:31:40.9436296Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.9436653Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:40.9436920Z E       ^
2025-05-07T20:31:40.9437392Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.9437841Z 
2025-05-07T20:31:40.9438262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:40.9438776Z 
2025-05-07T20:31:40.9438878Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.9439292Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.9439697Z     T=4096,
2025-05-07T20:31:40.9439882Z     D=5120,
2025-05-07T20:31:40.9440069Z     scale_ub=None,
2025-05-07T20:31:40.9440284Z     contiguous=True,
2025-05-07T20:31:40.9440510Z     compiled=True,
2025-05-07T20:31:40.9440710Z )
2025-05-07T20:31:41.4753559Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:41.4754849Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:41.4756201Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:41.4758142Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:41.4759525Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:41.4760929Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.4762288Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:41.4763672Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.4765088Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:41.4766346Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:41.4767557Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:41.4768766Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:41.4769804Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:41.4770824Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:41.4772094Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:41.4773376Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:41.4774490Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:41.4775532Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:41.4776706Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:41.4778132Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:41.4779199Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.4780119Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.4780940Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:41.4781959Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.6648457Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:41.6650945Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:41.6652296Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:41.6653719Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:41.6655093Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:41.6656466Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.6657766Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:41.6659146Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.6660568Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:41.6661861Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:41.6663082Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:41.6664284Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:41.6665330Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:41.6666351Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:41.6667579Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:41.6669075Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:41.6670264Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:41.6671435Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:41.6672655Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:41.6674012Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:41.6675075Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.6675987Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.6676745Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:41.6677762Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.1712163Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:42.1713557Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:42.1714901Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:42.1716327Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:42.1717697Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:42.1719065Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.1720369Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:42.1721736Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.1723149Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:42.1724386Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:42.1725759Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:42.1726959Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:42.1727985Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:42.1729104Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:42.1730310Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:42.1731580Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:42.1732680Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:42.1733715Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:42.1734882Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:42.1736221Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:42.1737272Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.1738171Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.1738906Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:42.1739912Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.2099936Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:42.2101215Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:42.2102595Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:42.2104184Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:42.2105560Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:42.2106926Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.2108403Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:42.2109765Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.2111354Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:42.2112584Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:42.2113795Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:42.2114993Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:42.2116015Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:42.2117033Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:42.2118239Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:42.2119510Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:42.2120612Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:42.2121642Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:42.2122807Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:42.2124154Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:42.2125211Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.2126113Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.2126846Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:42.2127865Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.6598461Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.6599905Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:42.6600634Z 
2025-05-07T20:31:42.6600840Z     @given(
2025-05-07T20:31:42.6601386Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.6601781Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.6602298Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.6602637Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.6602967Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.6603246Z     )
2025-05-07T20:31:42.6603595Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.6604343Z     def test_silu_mul_quant(
2025-05-07T20:31:42.6604586Z         self,
2025-05-07T20:31:42.6604780Z         T: int,
2025-05-07T20:31:42.6604979Z         D: int,
2025-05-07T20:31:42.6605189Z         scale_ub: Optional[float],
2025-05-07T20:31:42.6605461Z         contiguous: bool,
2025-05-07T20:31:42.6605700Z         compiled: bool,
2025-05-07T20:31:42.6605918Z     ) -> None:
2025-05-07T20:31:42.6606134Z         torch.manual_seed(2025)
2025-05-07T20:31:42.6606375Z     
2025-05-07T20:31:42.6606642Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.6606985Z     
2025-05-07T20:31:42.6607185Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.6607472Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.6607782Z         x = x_sign * x_clamp
2025-05-07T20:31:42.6608024Z         x0 = x[:, :D]
2025-05-07T20:31:42.6608239Z         x1 = x[:, D:]
2025-05-07T20:31:42.6608442Z     
2025-05-07T20:31:42.6608634Z         if contiguous:
2025-05-07T20:31:42.6608863Z             x0 = x0.contiguous()
2025-05-07T20:31:42.6609115Z             x1 = x1.contiguous()
2025-05-07T20:31:42.6609352Z     
2025-05-07T20:31:42.6609543Z         if scale_ub is not None:
2025-05-07T20:31:42.6609810Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.6610149Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.6610457Z             )
2025-05-07T20:31:42.6610646Z         else:
2025-05-07T20:31:42.6610856Z             scale_ub_tensor = None
2025-05-07T20:31:42.6611107Z     
2025-05-07T20:31:42.6611338Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.6611659Z             op = silu_mul_quant
2025-05-07T20:31:42.6611908Z             if compiled:
2025-05-07T20:31:42.6612151Z                 op = torch.compile(op)
2025-05-07T20:31:42.6612449Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.6612727Z     
2025-05-07T20:31:42.6612917Z         y_fp8, y_scale = fn()
2025-05-07T20:31:42.6613203Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:42.6613494Z     
2025-05-07T20:31:42.6613731Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.6614056Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:42.6614349Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:42.6614662Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:42.6615015Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:42.6615323Z     
2025-05-07T20:31:42.6615522Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:42.6615717Z 
2025-05-07T20:31:42.6615823Z moe/activation_test.py:126: 
2025-05-07T20:31:42.6616121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.6616452Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:42.6616777Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:42.6617558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:42.6618323Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:42.6618868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.6619544Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.6620225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:42.6621058Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:42.6621809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:42.6622544Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:42.6623382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:42.6624015Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:42.6624611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:42.6625119Z     fn()
2025-05-07T20:31:42.6625618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:42.6626199Z     self.fn.run(
2025-05-07T20:31:42.6626663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.6627192Z     kernel = self.compile(
2025-05-07T20:31:42.6627730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.6628383Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.6628783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.6629014Z 
2025-05-07T20:31:42.6629221Z self = <triton.compiler.compiler.ASTSource object at 0x7faa04c014c0>
2025-05-07T20:31:42.6630368Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.6631758Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa042d21f0>}
2025-05-07T20:31:42.6633096Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.6634121Z context = <triton._C.libtriton.ir.context object at 0x7faa03929b70>
2025-05-07T20:31:42.6634414Z 
2025-05-07T20:31:42.6634581Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.6635097Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.6635552Z                            module_map=module_map)
2025-05-07T20:31:42.6635917Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.6636274Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:42.6636538Z E       ^
2025-05-07T20:31:42.6637007Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.6637462Z 
2025-05-07T20:31:42.6637875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.6638384Z 
2025-05-07T20:31:42.6638493Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.6638907Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.6639306Z     T=16384,
2025-05-07T20:31:42.6639500Z     D=5120,
2025-05-07T20:31:42.6639690Z     scale_ub=None,
2025-05-07T20:31:42.6639899Z     contiguous=True,
2025-05-07T20:31:42.6640122Z     compiled=True,
2025-05-07T20:31:42.6640325Z )
2025-05-07T20:31:42.7083732Z W0507 20:31:42.706620 86817 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:31:42.7086532Z W0507 20:31:42.706620 86817 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:31:42.7089191Z W0507 20:31:42.706620 86817 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:31:42.7091176Z W0507 20:31:42.706620 86817 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:31:42.7092473Z W0507 20:31:42.706620 86817 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:31:42.8286155Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.8286897Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:42.8287273Z 
2025-05-07T20:31:42.8287377Z     @given(
2025-05-07T20:31:42.8287686Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.8288017Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.8288322Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.8288654Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.8288980Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.8289268Z     )
2025-05-07T20:31:42.8289614Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.8290055Z     def test_silu_mul_quant(
2025-05-07T20:31:42.8290296Z         self,
2025-05-07T20:31:42.8290492Z         T: int,
2025-05-07T20:31:42.8290686Z         D: int,
2025-05-07T20:31:42.8290898Z         scale_ub: Optional[float],
2025-05-07T20:31:42.8291168Z         contiguous: bool,
2025-05-07T20:31:42.8298911Z         compiled: bool,
2025-05-07T20:31:42.8299172Z     ) -> None:
2025-05-07T20:31:42.8299407Z         torch.manual_seed(2025)
2025-05-07T20:31:42.8299658Z     
2025-05-07T20:31:42.8299948Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.8300348Z     
2025-05-07T20:31:42.8300551Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.8300867Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.8301223Z         x = x_sign * x_clamp
2025-05-07T20:31:42.8301487Z         x0 = x[:, :D]
2025-05-07T20:31:42.8301755Z         x1 = x[:, D:]
2025-05-07T20:31:42.8301995Z     
2025-05-07T20:31:42.8302191Z         if contiguous:
2025-05-07T20:31:42.8302434Z             x0 = x0.contiguous()
2025-05-07T20:31:42.8302722Z             x1 = x1.contiguous()
2025-05-07T20:31:42.8302990Z     
2025-05-07T20:31:42.8303195Z         if scale_ub is not None:
2025-05-07T20:31:42.8303476Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.8303989Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.8304303Z             )
2025-05-07T20:31:42.8304496Z         else:
2025-05-07T20:31:42.8304709Z             scale_ub_tensor = None
2025-05-07T20:31:42.8304963Z     
2025-05-07T20:31:42.8305196Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.8305519Z             op = silu_mul_quant
2025-05-07T20:31:42.8305776Z             if compiled:
2025-05-07T20:31:42.8306022Z                 op = torch.compile(op)
2025-05-07T20:31:42.8306322Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.8306604Z     
2025-05-07T20:31:42.8306793Z         y_fp8, y_scale = fn()
2025-05-07T20:31:42.8307091Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:42.8307390Z     
2025-05-07T20:31:42.8307623Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.8307959Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:42.8308251Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:42.8308571Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:42.8308932Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:42.8309250Z     
2025-05-07T20:31:42.8309641Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:42.8309910Z 
2025-05-07T20:31:42.8310014Z moe/activation_test.py:126: 
2025-05-07T20:31:42.8310316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.8310653Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:42.8311100Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:42.8311959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:42.8312733Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:42.8313288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.8313969Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.8314672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:42.8315408Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:42.8316167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:42.8316926Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:42.8317677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:42.8318324Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:42.8318939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:42.8319450Z     fn()
2025-05-07T20:31:42.8319959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:42.8320552Z     self.fn.run(
2025-05-07T20:31:42.8321026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.8321562Z     kernel = self.compile(
2025-05-07T20:31:42.8322114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.8322773Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.8323181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.8323416Z 
2025-05-07T20:31:42.8323632Z self = <triton.compiler.compiler.ASTSource object at 0x7faa04761730>
2025-05-07T20:31:42.8324738Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.8326152Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa049af790>}
2025-05-07T20:31:42.8327523Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.8328574Z context = <triton._C.libtriton.ir.context object at 0x7faa032f90b0>
2025-05-07T20:31:42.8328867Z 
2025-05-07T20:31:42.8329042Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.8329570Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.8330042Z                            module_map=module_map)
2025-05-07T20:31:42.8330419Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.8330781Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:42.8331050Z E       ^
2025-05-07T20:31:42.8331616Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.8332074Z 
2025-05-07T20:31:42.8332501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.8333092Z 
2025-05-07T20:31:42.8333208Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.8333626Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.8334031Z     T=1,
2025-05-07T20:31:42.8334210Z     D=5120,
2025-05-07T20:31:42.8334405Z     scale_ub=1200.0,
2025-05-07T20:31:42.8334626Z     contiguous=True,
2025-05-07T20:31:42.8334852Z     compiled=True,
2025-05-07T20:31:42.8335060Z )
2025-05-07T20:31:43.2025096Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.2026308Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:43.2026924Z 
2025-05-07T20:31:43.2027107Z     @given(
2025-05-07T20:31:43.2027572Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.2028188Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.2028782Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.2029417Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.2030162Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.2030718Z     )
2025-05-07T20:31:43.2031397Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.2031912Z     def test_silu_mul_quant(
2025-05-07T20:31:43.2032154Z         self,
2025-05-07T20:31:43.2032341Z         T: int,
2025-05-07T20:31:43.2032536Z         D: int,
2025-05-07T20:31:43.2032753Z         scale_ub: Optional[float],
2025-05-07T20:31:43.2033019Z         contiguous: bool,
2025-05-07T20:31:43.2033256Z         compiled: bool,
2025-05-07T20:31:43.2033478Z     ) -> None:
2025-05-07T20:31:43.2033684Z         torch.manual_seed(2025)
2025-05-07T20:31:43.2033932Z     
2025-05-07T20:31:43.2034210Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.2034552Z     
2025-05-07T20:31:43.2034733Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.2035025Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.2035336Z         x = x_sign * x_clamp
2025-05-07T20:31:43.2035568Z         x0 = x[:, :D]
2025-05-07T20:31:43.2035783Z         x1 = x[:, D:]
2025-05-07T20:31:43.2035985Z     
2025-05-07T20:31:43.2036161Z         if contiguous:
2025-05-07T20:31:43.2036389Z             x0 = x0.contiguous()
2025-05-07T20:31:43.2036649Z             x1 = x1.contiguous()
2025-05-07T20:31:43.2036875Z     
2025-05-07T20:31:43.2037070Z         if scale_ub is not None:
2025-05-07T20:31:43.2037343Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.2037673Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.2037982Z             )
2025-05-07T20:31:43.2038167Z         else:
2025-05-07T20:31:43.2038371Z             scale_ub_tensor = None
2025-05-07T20:31:43.2038618Z     
2025-05-07T20:31:43.2038850Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.2039161Z             op = silu_mul_quant
2025-05-07T20:31:43.2039412Z             if compiled:
2025-05-07T20:31:43.2039659Z                 op = torch.compile(op)
2025-05-07T20:31:43.2039954Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.2040224Z     
2025-05-07T20:31:43.2040409Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.2040569Z 
2025-05-07T20:31:43.2040679Z moe/activation_test.py:117: 
2025-05-07T20:31:43.2040967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.2041293Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.2041568Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.2042124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.2042851Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.2043515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.2044195Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.2044718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.2045560Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.2046216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.2046747Z     kernel = self.compile(
2025-05-07T20:31:43.2047278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.2047925Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.2048323Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.2048549Z 
2025-05-07T20:31:43.2048754Z self = <triton.compiler.compiler.ASTSource object at 0x7faa03eaa220>
2025-05-07T20:31:43.2049832Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.2051267Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa049af310>}
2025-05-07T20:31:43.2052659Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.2053680Z context = <triton._C.libtriton.ir.context object at 0x7faa031dddf0>
2025-05-07T20:31:43.2053971Z 
2025-05-07T20:31:43.2054139Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.2054653Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.2055116Z                            module_map=module_map)
2025-05-07T20:31:43.2055481Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.2055825Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.2056078Z E       ^
2025-05-07T20:31:43.2056547Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.2056993Z 
2025-05-07T20:31:43.2057411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.2057914Z 
2025-05-07T20:31:43.2058013Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.2058421Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.2058824Z     T=1,
2025-05-07T20:31:43.2058996Z     D=5120,
2025-05-07T20:31:43.2059182Z     scale_ub=None,
2025-05-07T20:31:43.2059395Z     contiguous=False,
2025-05-07T20:31:43.2059615Z     compiled=True,
2025-05-07T20:31:43.2059819Z )
2025-05-07T20:31:43.2866185Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.2867224Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:43.2867746Z 
2025-05-07T20:31:43.2867909Z     @given(
2025-05-07T20:31:43.2868355Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.2868975Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.2869581Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.2870313Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.2870960Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.2871521Z     )
2025-05-07T20:31:43.2872192Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.2872644Z     def test_silu_mul_quant(
2025-05-07T20:31:43.2872896Z         self,
2025-05-07T20:31:43.2873093Z         T: int,
2025-05-07T20:31:43.2873291Z         D: int,
2025-05-07T20:31:43.2873509Z         scale_ub: Optional[float],
2025-05-07T20:31:43.2873892Z         contiguous: bool,
2025-05-07T20:31:43.2874125Z         compiled: bool,
2025-05-07T20:31:43.2874352Z     ) -> None:
2025-05-07T20:31:43.2874567Z         torch.manual_seed(2025)
2025-05-07T20:31:43.2874809Z     
2025-05-07T20:31:43.2875081Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.2875429Z     
2025-05-07T20:31:43.2875621Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.2875914Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.2876223Z         x = x_sign * x_clamp
2025-05-07T20:31:43.2876463Z         x0 = x[:, :D]
2025-05-07T20:31:43.2876681Z         x1 = x[:, D:]
2025-05-07T20:31:43.2876897Z     
2025-05-07T20:31:43.2877077Z         if contiguous:
2025-05-07T20:31:43.2877313Z             x0 = x0.contiguous()
2025-05-07T20:31:43.2877580Z             x1 = x1.contiguous()
2025-05-07T20:31:43.2877813Z     
2025-05-07T20:31:43.2878007Z         if scale_ub is not None:
2025-05-07T20:31:43.2878283Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.2878625Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.2878941Z             )
2025-05-07T20:31:43.2879141Z         else:
2025-05-07T20:31:43.2879359Z             scale_ub_tensor = None
2025-05-07T20:31:43.2879607Z     
2025-05-07T20:31:43.2879840Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.2880155Z             op = silu_mul_quant
2025-05-07T20:31:43.2880400Z             if compiled:
2025-05-07T20:31:43.2880651Z                 op = torch.compile(op)
2025-05-07T20:31:43.2880943Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.2881214Z     
2025-05-07T20:31:43.2881411Z         y_fp8, y_scale = fn()
2025-05-07T20:31:43.2881693Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:43.2881977Z     
2025-05-07T20:31:43.2882211Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.2882545Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:43.2882833Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:43.2883143Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:43.2883501Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:43.2883806Z     
2025-05-07T20:31:43.2884004Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:43.2884206Z 
2025-05-07T20:31:43.2884311Z moe/activation_test.py:126: 
2025-05-07T20:31:43.2884605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.2884931Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:43.2885259Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:43.2886040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:43.2886798Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:43.2887334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.2888018Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.2888710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:43.2889424Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:43.2890174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:43.2891008Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:43.2891737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:43.2892538Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:43.2893138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:43.2893734Z     fn()
2025-05-07T20:31:43.2894237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:43.2894809Z     self.fn.run(
2025-05-07T20:31:43.2895274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.2895805Z     kernel = self.compile(
2025-05-07T20:31:43.2896341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.2896996Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.2897394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.2897625Z 
2025-05-07T20:31:43.2897831Z self = <triton.compiler.compiler.ASTSource object at 0x7faa031c4b80>
2025-05-07T20:31:43.2898917Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.2900303Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa03b8ff70>}
2025-05-07T20:31:43.2901648Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.2902667Z context = <triton._C.libtriton.ir.context object at 0x7faa0321e8b0>
2025-05-07T20:31:43.2902957Z 
2025-05-07T20:31:43.2903125Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.2903649Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.2904263Z                            module_map=module_map)
2025-05-07T20:31:43.2904624Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.2904985Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:43.2905258Z E       ^
2025-05-07T20:31:43.2905726Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.2906179Z 
2025-05-07T20:31:43.2906593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.2907104Z 
2025-05-07T20:31:43.2907206Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.2907625Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.2908027Z     T=1,
2025-05-07T20:31:43.2908206Z     D=5120,
2025-05-07T20:31:43.2908396Z     scale_ub=None,
2025-05-07T20:31:43.2908603Z     contiguous=True,
2025-05-07T20:31:43.2908826Z     compiled=False,
2025-05-07T20:31:43.2909035Z )
2025-05-07T20:31:43.4871266Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.4871815Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:43.4872109Z 
2025-05-07T20:31:43.4872200Z     @given(
2025-05-07T20:31:43.4872529Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.4872920Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.4873230Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.4873559Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.4874076Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.4874368Z     )
2025-05-07T20:31:43.4874718Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.4875152Z     def test_silu_mul_quant(
2025-05-07T20:31:43.4875396Z         self,
2025-05-07T20:31:43.4875588Z         T: int,
2025-05-07T20:31:43.4875896Z         D: int,
2025-05-07T20:31:43.4876115Z         scale_ub: Optional[float],
2025-05-07T20:31:43.4876389Z         contiguous: bool,
2025-05-07T20:31:43.4876624Z         compiled: bool,
2025-05-07T20:31:43.4876850Z     ) -> None:
2025-05-07T20:31:43.4877063Z         torch.manual_seed(2025)
2025-05-07T20:31:43.4877309Z     
2025-05-07T20:31:43.4877580Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.4877921Z     
2025-05-07T20:31:43.4878118Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.4878403Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.4878712Z         x = x_sign * x_clamp
2025-05-07T20:31:43.4878965Z         x0 = x[:, :D]
2025-05-07T20:31:43.4879178Z         x1 = x[:, D:]
2025-05-07T20:31:43.4879387Z     
2025-05-07T20:31:43.4879580Z         if contiguous:
2025-05-07T20:31:43.4879811Z             x0 = x0.contiguous()
2025-05-07T20:31:43.4880071Z             x1 = x1.contiguous()
2025-05-07T20:31:43.4880314Z     
2025-05-07T20:31:43.4880511Z         if scale_ub is not None:
2025-05-07T20:31:43.4880785Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.4881123Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.4881432Z             )
2025-05-07T20:31:43.4881632Z         else:
2025-05-07T20:31:43.4881869Z             scale_ub_tensor = None
2025-05-07T20:31:43.4882143Z     
2025-05-07T20:31:43.4882375Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.4882694Z             op = silu_mul_quant
2025-05-07T20:31:43.4882950Z             if compiled:
2025-05-07T20:31:43.4883193Z                 op = torch.compile(op)
2025-05-07T20:31:43.4883499Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.4883777Z     
2025-05-07T20:31:43.4883965Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.4884137Z 
2025-05-07T20:31:43.4884239Z moe/activation_test.py:117: 
2025-05-07T20:31:43.4884533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.4884867Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.4885154Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.4885850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.4886542Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.4887074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.4887752Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.4888415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.4888939Z     kernel = self.compile(
2025-05-07T20:31:43.4889479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.4890134Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.4890535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.4890767Z 
2025-05-07T20:31:43.4890973Z self = <triton.compiler.compiler.ASTSource object at 0x7faa036b83a0>
2025-05-07T20:31:43.4892055Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.4893516Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa04853e50>}
2025-05-07T20:31:43.4894863Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.4895953Z context = <triton._C.libtriton.ir.context object at 0x7faa035cccf0>
2025-05-07T20:31:43.4896244Z 
2025-05-07T20:31:43.4896410Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.4896929Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.4897392Z                            module_map=module_map)
2025-05-07T20:31:43.4897750Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.4898103Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.4898360Z E       ^
2025-05-07T20:31:43.4898846Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.4899293Z 
2025-05-07T20:31:43.4899711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.4900224Z 
2025-05-07T20:31:43.4900329Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.4900753Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.4901157Z     T=128,
2025-05-07T20:31:43.4901342Z     D=5120,
2025-05-07T20:31:43.4901539Z     scale_ub=None,
2025-05-07T20:31:43.4901758Z     contiguous=False,
2025-05-07T20:31:43.4901979Z     compiled=True,
2025-05-07T20:31:43.4902188Z )
2025-05-07T20:31:43.4902503Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.4902991Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:43.4903256Z 
2025-05-07T20:31:43.4903336Z     @given(
2025-05-07T20:31:43.4903574Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.4904124Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.4904513Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.4904846Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.4905178Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.4905468Z     )
2025-05-07T20:31:43.4905817Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.4906258Z     def test_silu_mul_quant(
2025-05-07T20:31:43.4906499Z         self,
2025-05-07T20:31:43.4906691Z         T: int,
2025-05-07T20:31:43.4906893Z         D: int,
2025-05-07T20:31:43.4907122Z         scale_ub: Optional[float],
2025-05-07T20:31:43.4907391Z         contiguous: bool,
2025-05-07T20:31:43.4907635Z         compiled: bool,
2025-05-07T20:31:43.4914573Z     ) -> None:
2025-05-07T20:31:43.4914925Z         torch.manual_seed(2025)
2025-05-07T20:31:43.4915177Z     
2025-05-07T20:31:43.4915460Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.4915810Z     
2025-05-07T20:31:43.4916010Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.4916310Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.4916618Z         x = x_sign * x_clamp
2025-05-07T20:31:43.4916872Z         x0 = x[:, :D]
2025-05-07T20:31:43.4917096Z         x1 = x[:, D:]
2025-05-07T20:31:43.4917300Z     
2025-05-07T20:31:43.4917486Z         if contiguous:
2025-05-07T20:31:43.4917724Z             x0 = x0.contiguous()
2025-05-07T20:31:43.4917984Z             x1 = x1.contiguous()
2025-05-07T20:31:43.4918229Z     
2025-05-07T20:31:43.4918424Z         if scale_ub is not None:
2025-05-07T20:31:43.4918693Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.4919035Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.4919348Z             )
2025-05-07T20:31:43.4919538Z         else:
2025-05-07T20:31:43.4919924Z             scale_ub_tensor = None
2025-05-07T20:31:43.4920182Z     
2025-05-07T20:31:43.4920413Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.4920737Z             op = silu_mul_quant
2025-05-07T20:31:43.4920994Z             if compiled:
2025-05-07T20:31:43.4921246Z                 op = torch.compile(op)
2025-05-07T20:31:43.4921690Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.4921995Z     
2025-05-07T20:31:43.4922187Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.4922351Z 
2025-05-07T20:31:43.4922456Z moe/activation_test.py:117: 
2025-05-07T20:31:43.4922761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.4923095Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.4923376Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.4923940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.4924498Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.4925166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.4925857Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.4926389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.4927080Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.4927740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.4928281Z     kernel = self.compile(
2025-05-07T20:31:43.4928819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.4929469Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.4929863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.4930100Z 
2025-05-07T20:31:43.4930309Z self = <triton.compiler.compiler.ASTSource object at 0x7faa0440ecd0>
2025-05-07T20:31:43.4931395Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.4932779Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa03bb00d0>}
2025-05-07T20:31:43.4934117Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.4935147Z context = <triton._C.libtriton.ir.context object at 0x7faa02cd9af0>
2025-05-07T20:31:43.4935440Z 
2025-05-07T20:31:43.4935612Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.4936139Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.4936607Z                            module_map=module_map)
2025-05-07T20:31:43.4936981Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.4937339Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.4937596Z E       ^
2025-05-07T20:31:43.4938065Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.4938521Z 
2025-05-07T20:31:43.4938939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.4939450Z 
2025-05-07T20:31:43.4939558Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.4939976Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.4940379Z     T=128,
2025-05-07T20:31:43.4940655Z     D=7168,
2025-05-07T20:31:43.4940852Z     scale_ub=1200.0,
2025-05-07T20:31:43.4941075Z     contiguous=False,
2025-05-07T20:31:43.4941301Z     compiled=False,
2025-05-07T20:31:43.4941501Z )
2025-05-07T20:31:43.6461799Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.6462621Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:43.6462982Z 
2025-05-07T20:31:43.6463077Z     @given(
2025-05-07T20:31:43.6463313Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.6463632Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.6463952Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.6464278Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.6464609Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.6464899Z     )
2025-05-07T20:31:43.6465252Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.6465694Z     def test_silu_mul_quant(
2025-05-07T20:31:43.6465937Z         self,
2025-05-07T20:31:43.6466163Z         T: int,
2025-05-07T20:31:43.6466366Z         D: int,
2025-05-07T20:31:43.6466581Z         scale_ub: Optional[float],
2025-05-07T20:31:43.6466857Z         contiguous: bool,
2025-05-07T20:31:43.6467105Z         compiled: bool,
2025-05-07T20:31:43.6467325Z     ) -> None:
2025-05-07T20:31:43.6467542Z         torch.manual_seed(2025)
2025-05-07T20:31:43.6467788Z     
2025-05-07T20:31:43.6468057Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.6468398Z     
2025-05-07T20:31:43.6468591Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.6468877Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.6469188Z         x = x_sign * x_clamp
2025-05-07T20:31:43.6469429Z         x0 = x[:, :D]
2025-05-07T20:31:43.6469651Z         x1 = x[:, D:]
2025-05-07T20:31:43.6469933Z     
2025-05-07T20:31:43.6470125Z         if contiguous:
2025-05-07T20:31:43.6470361Z             x0 = x0.contiguous()
2025-05-07T20:31:43.6470617Z             x1 = x1.contiguous()
2025-05-07T20:31:43.6470853Z     
2025-05-07T20:31:43.6471048Z         if scale_ub is not None:
2025-05-07T20:31:43.6471316Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.6471655Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.6471970Z             )
2025-05-07T20:31:43.6472160Z         else:
2025-05-07T20:31:43.6472374Z             scale_ub_tensor = None
2025-05-07T20:31:43.6472627Z     
2025-05-07T20:31:43.6472856Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.6473173Z             op = silu_mul_quant
2025-05-07T20:31:43.6473426Z             if compiled:
2025-05-07T20:31:43.6473674Z                 op = torch.compile(op)
2025-05-07T20:31:43.6473971Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.6474246Z     
2025-05-07T20:31:43.6474434Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.6474612Z 
2025-05-07T20:31:43.6474715Z moe/activation_test.py:117: 
2025-05-07T20:31:43.6475019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.6475355Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.6475632Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.6476332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.6477022Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.6477554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.6478235Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.6478894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.6479426Z     kernel = self.compile(
2025-05-07T20:31:43.6480104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.6480756Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.6481159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.6481510Z 
2025-05-07T20:31:43.6481724Z self = <triton.compiler.compiler.ASTSource object at 0x7faa0358b820>
2025-05-07T20:31:43.6482806Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.6484187Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa0373aee0>}
2025-05-07T20:31:43.6485544Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.6486567Z context = <triton._C.libtriton.ir.context object at 0x7faa02c976b0>
2025-05-07T20:31:43.6486858Z 
2025-05-07T20:31:43.6487023Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.6487551Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.6488018Z                            module_map=module_map)
2025-05-07T20:31:43.6488383Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.6488730Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.6488994Z E       ^
2025-05-07T20:31:43.6489464Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.6489913Z 
2025-05-07T20:31:43.6490338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.6490850Z 
2025-05-07T20:31:43.6490954Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.6491367Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.6491775Z     T=128,
2025-05-07T20:31:43.6491959Z     D=5120,
2025-05-07T20:31:43.6492153Z     scale_ub=None,
2025-05-07T20:31:43.6492370Z     contiguous=False,
2025-05-07T20:31:43.6492593Z     compiled=False,
2025-05-07T20:31:43.6492793Z )
2025-05-07T20:31:43.6493112Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.6493600Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:43.6493876Z 
2025-05-07T20:31:43.6493952Z     @given(
2025-05-07T20:31:43.6494185Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.6494499Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.6494804Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.6495138Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.6495469Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.6495758Z     )
2025-05-07T20:31:43.6496116Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.6496563Z     def test_silu_mul_quant(
2025-05-07T20:31:43.6496804Z         self,
2025-05-07T20:31:43.6497003Z         T: int,
2025-05-07T20:31:43.6497204Z         D: int,
2025-05-07T20:31:43.6497427Z         scale_ub: Optional[float],
2025-05-07T20:31:43.6497699Z         contiguous: bool,
2025-05-07T20:31:43.6497939Z         compiled: bool,
2025-05-07T20:31:43.6498165Z     ) -> None:
2025-05-07T20:31:43.6498375Z         torch.manual_seed(2025)
2025-05-07T20:31:43.6498615Z     
2025-05-07T20:31:43.6498884Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.6499221Z     
2025-05-07T20:31:43.6499899Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.6500196Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.6500499Z         x = x_sign * x_clamp
2025-05-07T20:31:43.6500744Z         x0 = x[:, :D]
2025-05-07T20:31:43.6500962Z         x1 = x[:, D:]
2025-05-07T20:31:43.6501167Z     
2025-05-07T20:31:43.6501462Z         if contiguous:
2025-05-07T20:31:43.6501696Z             x0 = x0.contiguous()
2025-05-07T20:31:43.6501953Z             x1 = x1.contiguous()
2025-05-07T20:31:43.6502194Z     
2025-05-07T20:31:43.6502391Z         if scale_ub is not None:
2025-05-07T20:31:43.6502662Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.6502995Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.6503306Z             )
2025-05-07T20:31:43.6503496Z         else:
2025-05-07T20:31:43.6503884Z             scale_ub_tensor = None
2025-05-07T20:31:43.6504139Z     
2025-05-07T20:31:43.6504367Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.6504688Z             op = silu_mul_quant
2025-05-07T20:31:43.6504940Z             if compiled:
2025-05-07T20:31:43.6505185Z                 op = torch.compile(op)
2025-05-07T20:31:43.6505476Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.6505751Z     
2025-05-07T20:31:43.6505944Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.6506116Z 
2025-05-07T20:31:43.6506218Z moe/activation_test.py:117: 
2025-05-07T20:31:43.6506517Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.6506850Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.6507131Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.6507828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.6508524Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.6509074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.6509752Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.6510473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.6511008Z     kernel = self.compile(
2025-05-07T20:31:43.6511557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.6512207Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.6512607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.6512834Z 
2025-05-07T20:31:43.6513043Z self = <triton.compiler.compiler.ASTSource object at 0x7faa0361b220>
2025-05-07T20:31:43.6514127Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.6515508Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa03d1c820>}
2025-05-07T20:31:43.6516853Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.6517877Z context = <triton._C.libtriton.ir.context object at 0x7faa02c46270>
2025-05-07T20:31:43.6518164Z 
2025-05-07T20:31:43.6518338Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.6518858Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.6519324Z                            module_map=module_map)
2025-05-07T20:31:43.6519692Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.6520163Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.6520429Z E       ^
2025-05-07T20:31:43.6520899Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.6521349Z 
2025-05-07T20:31:43.6521815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.6522448Z 
2025-05-07T20:31:43.6522553Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.6522967Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.6523368Z     T=128,
2025-05-07T20:31:43.6523556Z     D=5120,
2025-05-07T20:31:43.6523744Z     scale_ub=1200.0,
2025-05-07T20:31:43.6523970Z     contiguous=True,
2025-05-07T20:31:43.6524196Z     compiled=False,
2025-05-07T20:31:43.6524397Z )
2025-05-07T20:31:43.8808825Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.8810286Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:43.8811020Z 
2025-05-07T20:31:43.8811172Z     @given(
2025-05-07T20:31:43.8811624Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.8812068Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.8812415Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.8812736Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.8813061Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.8813340Z     )
2025-05-07T20:31:43.8813683Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.8814121Z     def test_silu_mul_quant(
2025-05-07T20:31:43.8814356Z         self,
2025-05-07T20:31:43.8814550Z         T: int,
2025-05-07T20:31:43.8814747Z         D: int,
2025-05-07T20:31:43.8814957Z         scale_ub: Optional[float],
2025-05-07T20:31:43.8815230Z         contiguous: bool,
2025-05-07T20:31:43.8815471Z         compiled: bool,
2025-05-07T20:31:43.8815687Z     ) -> None:
2025-05-07T20:31:43.8815902Z         torch.manual_seed(2025)
2025-05-07T20:31:43.8816140Z     
2025-05-07T20:31:43.8816403Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.8816743Z     
2025-05-07T20:31:43.8816942Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.8817232Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.8817532Z         x = x_sign * x_clamp
2025-05-07T20:31:43.8817769Z         x0 = x[:, :D]
2025-05-07T20:31:43.8817979Z         x1 = x[:, D:]
2025-05-07T20:31:43.8818178Z     
2025-05-07T20:31:43.8818360Z         if contiguous:
2025-05-07T20:31:43.8818593Z             x0 = x0.contiguous()
2025-05-07T20:31:43.8818846Z             x1 = x1.contiguous()
2025-05-07T20:31:43.8819087Z     
2025-05-07T20:31:43.8819278Z         if scale_ub is not None:
2025-05-07T20:31:43.8819544Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.8819884Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.8820188Z             )
2025-05-07T20:31:43.8820374Z         else:
2025-05-07T20:31:43.8820586Z             scale_ub_tensor = None
2025-05-07T20:31:43.8820833Z     
2025-05-07T20:31:43.8821057Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.8821370Z             op = silu_mul_quant
2025-05-07T20:31:43.8821619Z             if compiled:
2025-05-07T20:31:43.8821863Z                 op = torch.compile(op)
2025-05-07T20:31:43.8822152Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.8822424Z     
2025-05-07T20:31:43.8822619Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.8822782Z 
2025-05-07T20:31:43.8822881Z moe/activation_test.py:117: 
2025-05-07T20:31:43.8823172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.8823503Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.8823780Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.8824638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.8825329Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.8825863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.8826653Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.8827305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.8827833Z     kernel = self.compile(
2025-05-07T20:31:43.8828358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.8829006Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.8829396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.8829627Z 
2025-05-07T20:31:43.8829919Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02c73970>
2025-05-07T20:31:43.8830992Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.8832421Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa03b8fdc0>}
2025-05-07T20:31:43.8833756Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.8834767Z context = <triton._C.libtriton.ir.context object at 0x7faa02d48b70>
2025-05-07T20:31:43.8835052Z 
2025-05-07T20:31:43.8835223Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.8835733Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.8836195Z                            module_map=module_map)
2025-05-07T20:31:43.8836554Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.8836901Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.8837153Z E       ^
2025-05-07T20:31:43.8837616Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.8838062Z 
2025-05-07T20:31:43.8838478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.8838984Z 
2025-05-07T20:31:43.8839086Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.8839494Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.8839892Z     T=1,
2025-05-07T20:31:43.8840073Z     D=7168,
2025-05-07T20:31:43.8840266Z     scale_ub=1200.0,
2025-05-07T20:31:43.8840488Z     contiguous=True,
2025-05-07T20:31:43.8840702Z     compiled=True,
2025-05-07T20:31:43.8840914Z )
2025-05-07T20:31:43.8841233Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.8841729Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:43.8842013Z 
2025-05-07T20:31:43.8842095Z     @given(
2025-05-07T20:31:43.8842338Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.8842646Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.8842942Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.8843268Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.8843593Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.8843875Z     )
2025-05-07T20:31:43.8844216Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.8844738Z     def test_silu_mul_quant(
2025-05-07T20:31:43.8844980Z         self,
2025-05-07T20:31:43.8845164Z         T: int,
2025-05-07T20:31:43.8845354Z         D: int,
2025-05-07T20:31:43.8845569Z         scale_ub: Optional[float],
2025-05-07T20:31:43.8845834Z         contiguous: bool,
2025-05-07T20:31:43.8846141Z         compiled: bool,
2025-05-07T20:31:43.8846357Z     ) -> None:
2025-05-07T20:31:43.8846562Z         torch.manual_seed(2025)
2025-05-07T20:31:43.8846797Z     
2025-05-07T20:31:43.8847068Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.8847398Z     
2025-05-07T20:31:43.8847591Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.8847879Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.8848176Z         x = x_sign * x_clamp
2025-05-07T20:31:43.8848413Z         x0 = x[:, :D]
2025-05-07T20:31:43.8848625Z         x1 = x[:, D:]
2025-05-07T20:31:43.8848820Z     
2025-05-07T20:31:43.8849009Z         if contiguous:
2025-05-07T20:31:43.8849238Z             x0 = x0.contiguous()
2025-05-07T20:31:43.8849491Z             x1 = x1.contiguous()
2025-05-07T20:31:43.8849726Z     
2025-05-07T20:31:43.8849918Z         if scale_ub is not None:
2025-05-07T20:31:43.8850186Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.8850520Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.8850824Z             )
2025-05-07T20:31:43.8851018Z         else:
2025-05-07T20:31:43.8851223Z             scale_ub_tensor = None
2025-05-07T20:31:43.8851478Z     
2025-05-07T20:31:43.8851706Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.8852012Z             op = silu_mul_quant
2025-05-07T20:31:43.8852254Z             if compiled:
2025-05-07T20:31:43.8852499Z                 op = torch.compile(op)
2025-05-07T20:31:43.8852795Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.8853069Z     
2025-05-07T20:31:43.8853257Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.8853423Z 
2025-05-07T20:31:43.8853522Z moe/activation_test.py:117: 
2025-05-07T20:31:43.8853815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.8854140Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.8854419Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.8854972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.8855520Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.8856178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.8856854Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.8857380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.8858050Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.8858704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.8859224Z     kernel = self.compile(
2025-05-07T20:31:43.8859754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.8860418Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.8860809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.8861033Z 
2025-05-07T20:31:43.8861242Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02c7e4c0>
2025-05-07T20:31:43.8862584Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.8864359Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa0419cd30>}
2025-05-07T20:31:43.8865701Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.8866785Z context = <triton._C.libtriton.ir.context object at 0x7faa031775f0>
2025-05-07T20:31:43.8867069Z 
2025-05-07T20:31:43.8867232Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.8867750Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.8875153Z                            module_map=module_map)
2025-05-07T20:31:43.8875587Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.8875937Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.8876198Z E       ^
2025-05-07T20:31:43.8876675Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.8877129Z 
2025-05-07T20:31:43.8877553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.8878061Z 
2025-05-07T20:31:43.8878170Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.8878584Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.8878983Z     T=1,
2025-05-07T20:31:43.8879160Z     D=7168,
2025-05-07T20:31:43.8879357Z     scale_ub=1200.0,
2025-05-07T20:31:43.8879577Z     contiguous=False,
2025-05-07T20:31:43.8879794Z     compiled=True,
2025-05-07T20:31:43.8879998Z )
2025-05-07T20:31:44.2520565Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.2521134Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:44.2521401Z 
2025-05-07T20:31:44.2521482Z     @given(
2025-05-07T20:31:44.2521734Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.2522051Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.2522357Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.2522692Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.2523028Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.2523317Z     )
2025-05-07T20:31:44.2523665Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.2524111Z     def test_silu_mul_quant(
2025-05-07T20:31:44.2524359Z         self,
2025-05-07T20:31:44.2524551Z         T: int,
2025-05-07T20:31:44.2524752Z         D: int,
2025-05-07T20:31:44.2524970Z         scale_ub: Optional[float],
2025-05-07T20:31:44.2525240Z         contiguous: bool,
2025-05-07T20:31:44.2525476Z         compiled: bool,
2025-05-07T20:31:44.2525705Z     ) -> None:
2025-05-07T20:31:44.2525917Z         torch.manual_seed(2025)
2025-05-07T20:31:44.2526163Z     
2025-05-07T20:31:44.2526434Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.2526774Z     
2025-05-07T20:31:44.2526974Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.2527262Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.2527581Z         x = x_sign * x_clamp
2025-05-07T20:31:44.2527817Z         x0 = x[:, :D]
2025-05-07T20:31:44.2528033Z         x1 = x[:, D:]
2025-05-07T20:31:44.2528239Z     
2025-05-07T20:31:44.2528435Z         if contiguous:
2025-05-07T20:31:44.2528668Z             x0 = x0.contiguous()
2025-05-07T20:31:44.2528930Z             x1 = x1.contiguous()
2025-05-07T20:31:44.2529164Z     
2025-05-07T20:31:44.2529357Z         if scale_ub is not None:
2025-05-07T20:31:44.2529633Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.2529976Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.2530282Z             )
2025-05-07T20:31:44.2530478Z         else:
2025-05-07T20:31:44.2530878Z             scale_ub_tensor = None
2025-05-07T20:31:44.2531127Z     
2025-05-07T20:31:44.2531362Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.2531680Z             op = silu_mul_quant
2025-05-07T20:31:44.2531949Z             if compiled:
2025-05-07T20:31:44.2532224Z                 op = torch.compile(op)
2025-05-07T20:31:44.2532647Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.2532916Z     
2025-05-07T20:31:44.2533107Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.2533272Z 
2025-05-07T20:31:44.2533380Z moe/activation_test.py:117: 
2025-05-07T20:31:44.2533670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.2534006Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.2534292Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.2534851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:44.2535413Z     return fn(*args, **kwargs)
2025-05-07T20:31:44.2536075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.2536760Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.2537292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.2537978Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.2538635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.2539161Z     kernel = self.compile(
2025-05-07T20:31:44.2539693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.2540341Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.2540738Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.2540965Z 
2025-05-07T20:31:44.2541177Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02dba7c0>
2025-05-07T20:31:44.2542311Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.2543745Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa03b9cb80>}
2025-05-07T20:31:44.2545087Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.2546103Z context = <triton._C.libtriton.ir.context object at 0x7faa030f51f0>
2025-05-07T20:31:44.2546385Z 
2025-05-07T20:31:44.2546558Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.2547074Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.2547539Z                            module_map=module_map)
2025-05-07T20:31:44.2547912Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.2548263Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.2548514Z E       ^
2025-05-07T20:31:44.2548974Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.2549424Z 
2025-05-07T20:31:44.2549925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.2550438Z 
2025-05-07T20:31:44.2550540Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.2550951Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.2551459Z     T=1,
2025-05-07T20:31:44.2551649Z     D=7168,
2025-05-07T20:31:44.2551838Z     scale_ub=None,
2025-05-07T20:31:44.2552054Z     contiguous=False,
2025-05-07T20:31:44.2552280Z     compiled=True,
2025-05-07T20:31:44.2552482Z )
2025-05-07T20:31:44.3688834Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.3689803Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:44.3690163Z 
2025-05-07T20:31:44.3690274Z     @given(
2025-05-07T20:31:44.3690577Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.3690919Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.3691225Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.3691554Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.3691884Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.3692197Z     )
2025-05-07T20:31:44.3692574Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.3693009Z     def test_silu_mul_quant(
2025-05-07T20:31:44.3693253Z         self,
2025-05-07T20:31:44.3693448Z         T: int,
2025-05-07T20:31:44.3693639Z         D: int,
2025-05-07T20:31:44.3693860Z         scale_ub: Optional[float],
2025-05-07T20:31:44.3694138Z         contiguous: bool,
2025-05-07T20:31:44.3694375Z         compiled: bool,
2025-05-07T20:31:44.3694600Z     ) -> None:
2025-05-07T20:31:44.3694814Z         torch.manual_seed(2025)
2025-05-07T20:31:44.3695050Z     
2025-05-07T20:31:44.3695317Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.3695660Z     
2025-05-07T20:31:44.3695846Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.3696136Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.3696443Z         x = x_sign * x_clamp
2025-05-07T20:31:44.3696681Z         x0 = x[:, :D]
2025-05-07T20:31:44.3696890Z         x1 = x[:, D:]
2025-05-07T20:31:44.3697098Z     
2025-05-07T20:31:44.3697289Z         if contiguous:
2025-05-07T20:31:44.3697518Z             x0 = x0.contiguous()
2025-05-07T20:31:44.3697778Z             x1 = x1.contiguous()
2025-05-07T20:31:44.3698017Z     
2025-05-07T20:31:44.3698202Z         if scale_ub is not None:
2025-05-07T20:31:44.3698473Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.3698813Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.3699118Z             )
2025-05-07T20:31:44.3699311Z         else:
2025-05-07T20:31:44.3699516Z             scale_ub_tensor = None
2025-05-07T20:31:44.3699762Z     
2025-05-07T20:31:44.3699993Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.3700307Z             op = silu_mul_quant
2025-05-07T20:31:44.3700561Z             if compiled:
2025-05-07T20:31:44.3700807Z                 op = torch.compile(op)
2025-05-07T20:31:44.3701103Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.3701378Z     
2025-05-07T20:31:44.3701567Z         y_fp8, y_scale = fn()
2025-05-07T20:31:44.3701852Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:44.3702144Z     
2025-05-07T20:31:44.3702374Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.3702715Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:44.3703014Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:44.3703323Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:44.3703684Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:44.3704184Z     
2025-05-07T20:31:44.3704485Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:44.3704685Z 
2025-05-07T20:31:44.3704786Z moe/activation_test.py:126: 
2025-05-07T20:31:44.3705081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.3705413Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:44.3705731Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:44.3706672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:44.3707437Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:44.3707979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.3708769Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.3709450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:44.3710245Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:44.3710988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:44.3711739Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:44.3712513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:44.3713151Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:44.3713743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:44.3714265Z     fn()
2025-05-07T20:31:44.3714767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:44.3715341Z     self.fn.run(
2025-05-07T20:31:44.3715802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.3716327Z     kernel = self.compile(
2025-05-07T20:31:44.3716857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.3717505Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.3717900Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.3718127Z 
2025-05-07T20:31:44.3718343Z self = <triton.compiler.compiler.ASTSource object at 0x7faa030e78e0>
2025-05-07T20:31:44.3719431Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.3720822Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa03105430>}
2025-05-07T20:31:44.3722196Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.3723245Z context = <triton._C.libtriton.ir.context object at 0x7faa031067b0>
2025-05-07T20:31:44.3723531Z 
2025-05-07T20:31:44.3723700Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.3724221Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.3724686Z                            module_map=module_map)
2025-05-07T20:31:44.3725047Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.3725404Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:44.3725667Z E       ^
2025-05-07T20:31:44.3726139Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.3726595Z 
2025-05-07T20:31:44.3727011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.3727519Z 
2025-05-07T20:31:44.3727629Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.3728144Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.3728549Z     T=1,
2025-05-07T20:31:44.3728730Z     D=5120,
2025-05-07T20:31:44.3728926Z     scale_ub=1200.0,
2025-05-07T20:31:44.3729154Z     contiguous=False,
2025-05-07T20:31:44.3729380Z     compiled=True,
2025-05-07T20:31:44.3729652Z )
2025-05-07T20:31:44.5734783Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.5735537Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:44.5735908Z 
2025-05-07T20:31:44.5736018Z     @given(
2025-05-07T20:31:44.5736317Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.5736734Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.5737042Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.5737370Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.5737691Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.5737981Z     )
2025-05-07T20:31:44.5738326Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.5738754Z     def test_silu_mul_quant(
2025-05-07T20:31:44.5738994Z         self,
2025-05-07T20:31:44.5739181Z         T: int,
2025-05-07T20:31:44.5739378Z         D: int,
2025-05-07T20:31:44.5739598Z         scale_ub: Optional[float],
2025-05-07T20:31:44.5739868Z         contiguous: bool,
2025-05-07T20:31:44.5740092Z         compiled: bool,
2025-05-07T20:31:44.5740309Z     ) -> None:
2025-05-07T20:31:44.5740519Z         torch.manual_seed(2025)
2025-05-07T20:31:44.5740750Z     
2025-05-07T20:31:44.5741017Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.5741351Z     
2025-05-07T20:31:44.5741541Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.5741818Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.5742146Z         x = x_sign * x_clamp
2025-05-07T20:31:44.5742417Z         x0 = x[:, :D]
2025-05-07T20:31:44.5742628Z         x1 = x[:, D:]
2025-05-07T20:31:44.5742829Z     
2025-05-07T20:31:44.5743007Z         if contiguous:
2025-05-07T20:31:44.5743225Z             x0 = x0.contiguous()
2025-05-07T20:31:44.5743481Z             x1 = x1.contiguous()
2025-05-07T20:31:44.5743714Z     
2025-05-07T20:31:44.5743900Z         if scale_ub is not None:
2025-05-07T20:31:44.5744168Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.5744504Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.5744805Z             )
2025-05-07T20:31:44.5744992Z         else:
2025-05-07T20:31:44.5745202Z             scale_ub_tensor = None
2025-05-07T20:31:44.5745445Z     
2025-05-07T20:31:44.5745670Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.5745975Z             op = silu_mul_quant
2025-05-07T20:31:44.5746217Z             if compiled:
2025-05-07T20:31:44.5746458Z                 op = torch.compile(op)
2025-05-07T20:31:44.5746756Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.5747024Z     
2025-05-07T20:31:44.5747212Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.5747378Z 
2025-05-07T20:31:44.5747479Z moe/activation_test.py:117: 
2025-05-07T20:31:44.5747773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.5748099Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.5748380Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.5748943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:44.5749488Z     return fn(*args, **kwargs)
2025-05-07T20:31:44.5750249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.5750932Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.5751638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.5752314Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.5752966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.5753487Z     kernel = self.compile(
2025-05-07T20:31:44.5754128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.5754774Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.5755166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.5755390Z 
2025-05-07T20:31:44.5755599Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02f3f610>
2025-05-07T20:31:44.5756678Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.5758080Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa03105e50>}
2025-05-07T20:31:44.5759409Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.5760436Z context = <triton._C.libtriton.ir.context object at 0x7faa02f04570>
2025-05-07T20:31:44.5760724Z 
2025-05-07T20:31:44.5760886Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.5761407Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.5761879Z                            module_map=module_map)
2025-05-07T20:31:44.5762280Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.5762633Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.5762882Z E       ^
2025-05-07T20:31:44.5763345Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.5763796Z 
2025-05-07T20:31:44.5764209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.5764721Z 
2025-05-07T20:31:44.5764826Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.5765234Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.5765638Z     T=1,
2025-05-07T20:31:44.5765820Z     D=5120,
2025-05-07T20:31:44.5766002Z     scale_ub=1200.0,
2025-05-07T20:31:44.5766226Z     contiguous=False,
2025-05-07T20:31:44.5766454Z     compiled=False,
2025-05-07T20:31:44.5766654Z )
2025-05-07T20:31:44.5766965Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.5767451Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:44.5767713Z 
2025-05-07T20:31:44.5767795Z     @given(
2025-05-07T20:31:44.5768014Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.5768328Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.5768635Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.5768953Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.5769282Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.5769563Z     )
2025-05-07T20:31:44.5769902Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.5770335Z     def test_silu_mul_quant(
2025-05-07T20:31:44.5770570Z         self,
2025-05-07T20:31:44.5770753Z         T: int,
2025-05-07T20:31:44.5770947Z         D: int,
2025-05-07T20:31:44.5771171Z         scale_ub: Optional[float],
2025-05-07T20:31:44.5771439Z         contiguous: bool,
2025-05-07T20:31:44.5771748Z         compiled: bool,
2025-05-07T20:31:44.5771969Z     ) -> None:
2025-05-07T20:31:44.5772185Z         torch.manual_seed(2025)
2025-05-07T20:31:44.5772457Z     
2025-05-07T20:31:44.5772728Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.5773065Z     
2025-05-07T20:31:44.5773321Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.5773607Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.5773912Z         x = x_sign * x_clamp
2025-05-07T20:31:44.5774143Z         x0 = x[:, :D]
2025-05-07T20:31:44.5774359Z         x1 = x[:, D:]
2025-05-07T20:31:44.5774563Z     
2025-05-07T20:31:44.5774736Z         if contiguous:
2025-05-07T20:31:44.5774961Z             x0 = x0.contiguous()
2025-05-07T20:31:44.5775215Z             x1 = x1.contiguous()
2025-05-07T20:31:44.5775445Z     
2025-05-07T20:31:44.5775633Z         if scale_ub is not None:
2025-05-07T20:31:44.5775903Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.5776232Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.5776539Z             )
2025-05-07T20:31:44.5776725Z         else:
2025-05-07T20:31:44.5776930Z             scale_ub_tensor = None
2025-05-07T20:31:44.5777165Z     
2025-05-07T20:31:44.5777390Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.5777708Z             op = silu_mul_quant
2025-05-07T20:31:44.5777945Z             if compiled:
2025-05-07T20:31:44.5778186Z                 op = torch.compile(op)
2025-05-07T20:31:44.5778479Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.5778746Z     
2025-05-07T20:31:44.5778931Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.5779094Z 
2025-05-07T20:31:44.5779198Z moe/activation_test.py:117: 
2025-05-07T20:31:44.5779481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.5779804Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.5780086Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.5780776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.5781450Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.5781980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.5782662Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.5783309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.5783838Z     kernel = self.compile(
2025-05-07T20:31:44.5784368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.5785011Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.5785395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.5785625Z 
2025-05-07T20:31:44.5785829Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02f359d0>
2025-05-07T20:31:44.5786901Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.5788278Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02f22820>}
2025-05-07T20:31:44.5789619Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.5790703Z context = <triton._C.libtriton.ir.context object at 0x7faa029983b0>
2025-05-07T20:31:44.5790994Z 
2025-05-07T20:31:44.5791237Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.5791764Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.5792268Z                            module_map=module_map)
2025-05-07T20:31:44.5792634Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.5793085Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.5793341Z E       ^
2025-05-07T20:31:44.5793797Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.5794249Z 
2025-05-07T20:31:44.5794662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.5795169Z 
2025-05-07T20:31:44.5795279Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.5795689Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.5796089Z     T=16384,
2025-05-07T20:31:44.5796284Z     D=5120,
2025-05-07T20:31:44.5796472Z     scale_ub=1200.0,
2025-05-07T20:31:44.5796688Z     contiguous=False,
2025-05-07T20:31:44.5796909Z     compiled=True,
2025-05-07T20:31:44.5797111Z )
2025-05-07T20:31:44.6993490Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.6994281Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:44.6994667Z 
2025-05-07T20:31:44.6994785Z     @given(
2025-05-07T20:31:44.6995054Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.6995373Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.6995686Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.6996014Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.6996352Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.6996640Z     )
2025-05-07T20:31:44.6996993Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.6997437Z     def test_silu_mul_quant(
2025-05-07T20:31:44.6997685Z         self,
2025-05-07T20:31:44.6997875Z         T: int,
2025-05-07T20:31:44.6998081Z         D: int,
2025-05-07T20:31:44.6998301Z         scale_ub: Optional[float],
2025-05-07T20:31:44.6998577Z         contiguous: bool,
2025-05-07T20:31:44.6998819Z         compiled: bool,
2025-05-07T20:31:44.6999045Z     ) -> None:
2025-05-07T20:31:44.6999267Z         torch.manual_seed(2025)
2025-05-07T20:31:44.6999511Z     
2025-05-07T20:31:44.6999784Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.7000134Z     
2025-05-07T20:31:44.7000327Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.7000626Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.7000938Z         x = x_sign * x_clamp
2025-05-07T20:31:44.7001168Z         x0 = x[:, :D]
2025-05-07T20:31:44.7001384Z         x1 = x[:, D:]
2025-05-07T20:31:44.7001592Z     
2025-05-07T20:31:44.7001783Z         if contiguous:
2025-05-07T20:31:44.7002052Z             x0 = x0.contiguous()
2025-05-07T20:31:44.7002334Z             x1 = x1.contiguous()
2025-05-07T20:31:44.7002567Z     
2025-05-07T20:31:44.7002765Z         if scale_ub is not None:
2025-05-07T20:31:44.7003039Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.7003376Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.7003869Z             )
2025-05-07T20:31:44.7004068Z         else:
2025-05-07T20:31:44.7004281Z             scale_ub_tensor = None
2025-05-07T20:31:44.7004530Z     
2025-05-07T20:31:44.7004762Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.7005076Z             op = silu_mul_quant
2025-05-07T20:31:44.7005319Z             if compiled:
2025-05-07T20:31:44.7005566Z                 op = torch.compile(op)
2025-05-07T20:31:44.7005862Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.7006131Z     
2025-05-07T20:31:44.7006320Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.7006656Z 
2025-05-07T20:31:44.7006770Z moe/activation_test.py:117: 
2025-05-07T20:31:44.7007063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.7007393Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.7007676Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.7008359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:44.7008909Z     return fn(*args, **kwargs)
2025-05-07T20:31:44.7009569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.7010254Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.7010784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.7011459Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.7012154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.7012718Z     kernel = self.compile(
2025-05-07T20:31:44.7013250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.7013912Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.7014306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.7014533Z 
2025-05-07T20:31:44.7014745Z self = <triton.compiler.compiler.ASTSource object at 0x7faa03138790>
2025-05-07T20:31:44.7015823Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.7017206Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa0295d790>}
2025-05-07T20:31:44.7018547Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.7019590Z context = <triton._C.libtriton.ir.context object at 0x7faa02ffbaf0>
2025-05-07T20:31:44.7019877Z 
2025-05-07T20:31:44.7020048Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.7020571Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.7021035Z                            module_map=module_map)
2025-05-07T20:31:44.7021400Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.7021745Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.7022002Z E       ^
2025-05-07T20:31:44.7022475Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.7022926Z 
2025-05-07T20:31:44.7023346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.7023856Z 
2025-05-07T20:31:44.7023964Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.7024376Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.7024776Z     T=2048,
2025-05-07T20:31:44.7024957Z     D=7168,
2025-05-07T20:31:44.7025148Z     scale_ub=1200.0,
2025-05-07T20:31:44.7025372Z     contiguous=False,
2025-05-07T20:31:44.7025597Z     compiled=True,
2025-05-07T20:31:44.7025794Z )
2025-05-07T20:31:44.7026109Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.7026602Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:44.7026874Z 
2025-05-07T20:31:44.7027481Z     @given(
2025-05-07T20:31:44.7027713Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.7028029Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.7028330Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.7028655Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.7029062Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.7029344Z     )
2025-05-07T20:31:44.7029691Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.7030198Z     def test_silu_mul_quant(
2025-05-07T20:31:44.7030438Z         self,
2025-05-07T20:31:44.7030625Z         T: int,
2025-05-07T20:31:44.7030822Z         D: int,
2025-05-07T20:31:44.7031038Z         scale_ub: Optional[float],
2025-05-07T20:31:44.7031299Z         contiguous: bool,
2025-05-07T20:31:44.7031539Z         compiled: bool,
2025-05-07T20:31:44.7031764Z     ) -> None:
2025-05-07T20:31:44.7031995Z         torch.manual_seed(2025)
2025-05-07T20:31:44.7032268Z     
2025-05-07T20:31:44.7032546Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.7032882Z     
2025-05-07T20:31:44.7033076Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.7033367Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.7033675Z         x = x_sign * x_clamp
2025-05-07T20:31:44.7033914Z         x0 = x[:, :D]
2025-05-07T20:31:44.7034129Z         x1 = x[:, D:]
2025-05-07T20:31:44.7034329Z     
2025-05-07T20:31:44.7034515Z         if contiguous:
2025-05-07T20:31:44.7034743Z             x0 = x0.contiguous()
2025-05-07T20:31:44.7035006Z             x1 = x1.contiguous()
2025-05-07T20:31:44.7035239Z     
2025-05-07T20:31:44.7035428Z         if scale_ub is not None:
2025-05-07T20:31:44.7035697Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.7036024Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.7036335Z             )
2025-05-07T20:31:44.7036531Z         else:
2025-05-07T20:31:44.7036739Z             scale_ub_tensor = None
2025-05-07T20:31:44.7036986Z     
2025-05-07T20:31:44.7037220Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.7037528Z             op = silu_mul_quant
2025-05-07T20:31:44.7037778Z             if compiled:
2025-05-07T20:31:44.7038033Z                 op = torch.compile(op)
2025-05-07T20:31:44.7038328Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.7038600Z     
2025-05-07T20:31:44.7038792Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.7038953Z 
2025-05-07T20:31:44.7039053Z moe/activation_test.py:117: 
2025-05-07T20:31:44.7039346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.7039676Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.7039958Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.7040506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:44.7041067Z     return fn(*args, **kwargs)
2025-05-07T20:31:44.7041728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.7042461Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.7042994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.7043677Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.7044332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.7044852Z     kernel = self.compile(
2025-05-07T20:31:44.7045387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.7046038Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.7046512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.7046740Z 
2025-05-07T20:31:44.7046946Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02fe5b20>
2025-05-07T20:31:44.7048029Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.7049481Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02a404c0>}
2025-05-07T20:31:44.7050828Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.7051880Z context = <triton._C.libtriton.ir.context object at 0x7faa02c38230>
2025-05-07T20:31:44.7052202Z 
2025-05-07T20:31:44.7052369Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.7052898Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.7053363Z                            module_map=module_map)
2025-05-07T20:31:44.7053728Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.7054080Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.7054344Z E       ^
2025-05-07T20:31:44.7054810Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.7055265Z 
2025-05-07T20:31:44.7055683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.7056195Z 
2025-05-07T20:31:44.9728251Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.9728803Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.9729401Z     T=1,
2025-05-07T20:31:44.9729658Z     D=5120,
2025-05-07T20:31:44.9729920Z     scale_ub=None,
2025-05-07T20:31:44.9730212Z     contiguous=False,
2025-05-07T20:31:44.9730474Z     compiled=False,
2025-05-07T20:31:44.9730685Z )
2025-05-07T20:31:44.9731035Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.9731530Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:44.9731803Z 
2025-05-07T20:31:44.9731882Z     @given(
2025-05-07T20:31:44.9732135Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.9732458Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.9732762Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.9733096Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.9733429Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.9733713Z     )
2025-05-07T20:31:44.9734062Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.9734503Z     def test_silu_mul_quant(
2025-05-07T20:31:44.9734740Z         self,
2025-05-07T20:31:44.9734937Z         T: int,
2025-05-07T20:31:44.9735137Z         D: int,
2025-05-07T20:31:44.9735351Z         scale_ub: Optional[float],
2025-05-07T20:31:44.9735625Z         contiguous: bool,
2025-05-07T20:31:44.9735865Z         compiled: bool,
2025-05-07T20:31:44.9736091Z     ) -> None:
2025-05-07T20:31:44.9736303Z         torch.manual_seed(2025)
2025-05-07T20:31:44.9736550Z     
2025-05-07T20:31:44.9736820Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.9737157Z     
2025-05-07T20:31:44.9737353Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.9737647Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.9737953Z         x = x_sign * x_clamp
2025-05-07T20:31:44.9738194Z         x0 = x[:, :D]
2025-05-07T20:31:44.9738413Z         x1 = x[:, D:]
2025-05-07T20:31:44.9738810Z     
2025-05-07T20:31:44.9739010Z         if contiguous:
2025-05-07T20:31:44.9739244Z             x0 = x0.contiguous()
2025-05-07T20:31:44.9739501Z             x1 = x1.contiguous()
2025-05-07T20:31:44.9739741Z     
2025-05-07T20:31:44.9739935Z         if scale_ub is not None:
2025-05-07T20:31:44.9740205Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.9740657Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.9740967Z             )
2025-05-07T20:31:44.9741161Z         else:
2025-05-07T20:31:44.9741369Z             scale_ub_tensor = None
2025-05-07T20:31:44.9741617Z     
2025-05-07T20:31:44.9741848Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.9742158Z             op = silu_mul_quant
2025-05-07T20:31:44.9742408Z             if compiled:
2025-05-07T20:31:44.9742655Z                 op = torch.compile(op)
2025-05-07T20:31:44.9742948Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.9743222Z     
2025-05-07T20:31:44.9743422Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.9743585Z 
2025-05-07T20:31:44.9743685Z moe/activation_test.py:117: 
2025-05-07T20:31:44.9743978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.9744308Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.9744591Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.9745278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.9745965Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.9746505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.9747176Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.9747840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.9748373Z     kernel = self.compile(
2025-05-07T20:31:44.9748911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.9749557Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.9750048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.9750279Z 
2025-05-07T20:31:44.9750492Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02c03eb0>
2025-05-07T20:31:44.9751575Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.9753005Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02a40820>}
2025-05-07T20:31:44.9754357Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.9755379Z context = <triton._C.libtriton.ir.context object at 0x7faa02f89070>
2025-05-07T20:31:44.9755668Z 
2025-05-07T20:31:44.9755837Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.9756357Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.9756824Z                            module_map=module_map)
2025-05-07T20:31:44.9757188Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.9757540Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.9757798Z E       ^
2025-05-07T20:31:44.9758266Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.9758716Z 
2025-05-07T20:31:44.9759221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.9759730Z 
2025-05-07T20:31:44.9759832Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.9760245Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.9760731Z     T=4096,
2025-05-07T20:31:44.9760914Z     D=7168,
2025-05-07T20:31:44.9761105Z     scale_ub=1200.0,
2025-05-07T20:31:44.9761333Z     contiguous=False,
2025-05-07T20:31:44.9761554Z     compiled=False,
2025-05-07T20:31:44.9761757Z )
2025-05-07T20:31:44.9762077Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.9762619Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:44.9762897Z 
2025-05-07T20:31:44.9762979Z     @given(
2025-05-07T20:31:44.9763211Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.9763527Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.9763832Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.9764164Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.9764490Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.9764771Z     )
2025-05-07T20:31:44.9765119Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.9765566Z     def test_silu_mul_quant(
2025-05-07T20:31:44.9765804Z         self,
2025-05-07T20:31:44.9765995Z         T: int,
2025-05-07T20:31:44.9766194Z         D: int,
2025-05-07T20:31:44.9766403Z         scale_ub: Optional[float],
2025-05-07T20:31:44.9766672Z         contiguous: bool,
2025-05-07T20:31:44.9766913Z         compiled: bool,
2025-05-07T20:31:44.9767133Z     ) -> None:
2025-05-07T20:31:44.9767350Z         torch.manual_seed(2025)
2025-05-07T20:31:44.9767590Z     
2025-05-07T20:31:44.9767859Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.9768199Z     
2025-05-07T20:31:44.9768393Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.9768685Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.9768987Z         x = x_sign * x_clamp
2025-05-07T20:31:44.9769227Z         x0 = x[:, :D]
2025-05-07T20:31:44.9769441Z         x1 = x[:, D:]
2025-05-07T20:31:44.9769647Z     
2025-05-07T20:31:44.9769831Z         if contiguous:
2025-05-07T20:31:44.9770062Z             x0 = x0.contiguous()
2025-05-07T20:31:44.9770315Z             x1 = x1.contiguous()
2025-05-07T20:31:44.9770555Z     
2025-05-07T20:31:44.9770746Z         if scale_ub is not None:
2025-05-07T20:31:44.9771018Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.9771352Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.9771658Z             )
2025-05-07T20:31:44.9771847Z         else:
2025-05-07T20:31:44.9772057Z             scale_ub_tensor = None
2025-05-07T20:31:44.9772308Z     
2025-05-07T20:31:44.9772545Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.9772852Z             op = silu_mul_quant
2025-05-07T20:31:44.9773103Z             if compiled:
2025-05-07T20:31:44.9773351Z                 op = torch.compile(op)
2025-05-07T20:31:44.9773644Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.9773923Z     
2025-05-07T20:31:44.9774116Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.9774280Z 
2025-05-07T20:31:44.9774381Z moe/activation_test.py:117: 
2025-05-07T20:31:44.9774674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.9775004Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.9775282Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.9775968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.9776653Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.9777265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.9777943Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.9778596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.9779198Z     kernel = self.compile(
2025-05-07T20:31:44.9779728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.9780379Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.9780773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.9780999Z 
2025-05-07T20:31:44.9781210Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02bc6ca0>
2025-05-07T20:31:44.9782349Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.9783726Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02f8faf0>}
2025-05-07T20:31:44.9785072Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.9786093Z context = <triton._C.libtriton.ir.context object at 0x7faa0302aaf0>
2025-05-07T20:31:44.9786379Z 
2025-05-07T20:31:44.9786549Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.9787065Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.9787529Z                            module_map=module_map)
2025-05-07T20:31:44.9787896Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.9788246Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.9788506Z E       ^
2025-05-07T20:31:44.9788972Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.9789424Z 
2025-05-07T20:31:44.9789920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.9790429Z 
2025-05-07T20:31:44.9790530Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.9790944Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.9791353Z     T=16384,
2025-05-07T20:31:44.9791541Z     D=7168,
2025-05-07T20:31:44.9791736Z     scale_ub=None,
2025-05-07T20:31:44.9791951Z     contiguous=True,
2025-05-07T20:31:44.9792175Z     compiled=True,
2025-05-07T20:31:44.9792399Z )
2025-05-07T20:31:45.0966379Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.0967125Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.0967518Z 
2025-05-07T20:31:45.0967628Z     @given(
2025-05-07T20:31:45.0967930Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.0968364Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.0968685Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.0969012Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.0969347Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.0969636Z     )
2025-05-07T20:31:45.0969988Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.0970426Z     def test_silu_mul_quant(
2025-05-07T20:31:45.0970672Z         self,
2025-05-07T20:31:45.0970868Z         T: int,
2025-05-07T20:31:45.0971061Z         D: int,
2025-05-07T20:31:45.0971278Z         scale_ub: Optional[float],
2025-05-07T20:31:45.0971722Z         contiguous: bool,
2025-05-07T20:31:45.0971965Z         compiled: bool,
2025-05-07T20:31:45.0972192Z     ) -> None:
2025-05-07T20:31:45.0972410Z         torch.manual_seed(2025)
2025-05-07T20:31:45.0972647Z     
2025-05-07T20:31:45.0972920Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.0973376Z     
2025-05-07T20:31:45.0973573Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.0973864Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.0974180Z         x = x_sign * x_clamp
2025-05-07T20:31:45.0974418Z         x0 = x[:, :D]
2025-05-07T20:31:45.0974636Z         x1 = x[:, D:]
2025-05-07T20:31:45.0974844Z     
2025-05-07T20:31:45.0975025Z         if contiguous:
2025-05-07T20:31:45.0975256Z             x0 = x0.contiguous()
2025-05-07T20:31:45.0975516Z             x1 = x1.contiguous()
2025-05-07T20:31:45.0975761Z     
2025-05-07T20:31:45.0975950Z         if scale_ub is not None:
2025-05-07T20:31:45.0976237Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.0976577Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.0976882Z             )
2025-05-07T20:31:45.0977076Z         else:
2025-05-07T20:31:45.0977290Z             scale_ub_tensor = None
2025-05-07T20:31:45.0977541Z     
2025-05-07T20:31:45.0977771Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.0978088Z             op = silu_mul_quant
2025-05-07T20:31:45.0984811Z             if compiled:
2025-05-07T20:31:45.0985088Z                 op = torch.compile(op)
2025-05-07T20:31:45.0985409Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.0985698Z     
2025-05-07T20:31:45.0985904Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.0986079Z 
2025-05-07T20:31:45.0986186Z moe/activation_test.py:117: 
2025-05-07T20:31:45.0986487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.0986836Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.0987138Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.0987711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.0988292Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.0988971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.0989683Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.0990291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.0990994Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.0991672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.0992212Z     kernel = self.compile(
2025-05-07T20:31:45.0992808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.0993470Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.0993872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.0994104Z 
2025-05-07T20:31:45.0994314Z self = <triton.compiler.compiler.ASTSource object at 0x7faa03028790>
2025-05-07T20:31:45.0995410Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.0996812Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02e67790>}
2025-05-07T20:31:45.0998277Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.0999311Z context = <triton._C.libtriton.ir.context object at 0x7faa02e440f0>
2025-05-07T20:31:45.0999598Z 
2025-05-07T20:31:45.0999766Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.1000288Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.1000833Z                            module_map=module_map)
2025-05-07T20:31:45.1001193Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.1001551Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.1001811Z E       ^
2025-05-07T20:31:45.1002281Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.1002735Z 
2025-05-07T20:31:45.1003157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.1003678Z 
2025-05-07T20:31:45.1003951Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.1004369Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.1004772Z     T=4096,
2025-05-07T20:31:45.1004954Z     D=5120,
2025-05-07T20:31:45.1005142Z     scale_ub=None,
2025-05-07T20:31:45.1005370Z     contiguous=False,
2025-05-07T20:31:45.1005596Z     compiled=True,
2025-05-07T20:31:45.1005799Z )
2025-05-07T20:31:45.1006115Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.1006603Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:45.1006879Z 
2025-05-07T20:31:45.1006955Z     @given(
2025-05-07T20:31:45.1007181Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.1007491Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.1007798Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.1008135Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.1008473Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.1008756Z     )
2025-05-07T20:31:45.1009107Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.1009553Z     def test_silu_mul_quant(
2025-05-07T20:31:45.1009794Z         self,
2025-05-07T20:31:45.1009987Z         T: int,
2025-05-07T20:31:45.1010179Z         D: int,
2025-05-07T20:31:45.1010391Z         scale_ub: Optional[float],
2025-05-07T20:31:45.1010661Z         contiguous: bool,
2025-05-07T20:31:45.1010903Z         compiled: bool,
2025-05-07T20:31:45.1011152Z     ) -> None:
2025-05-07T20:31:45.1011364Z         torch.manual_seed(2025)
2025-05-07T20:31:45.1011607Z     
2025-05-07T20:31:45.1011886Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.1012265Z     
2025-05-07T20:31:45.1012465Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.1012763Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.1013070Z         x = x_sign * x_clamp
2025-05-07T20:31:45.1013311Z         x0 = x[:, :D]
2025-05-07T20:31:45.1013527Z         x1 = x[:, D:]
2025-05-07T20:31:45.1013739Z     
2025-05-07T20:31:45.1013918Z         if contiguous:
2025-05-07T20:31:45.1014154Z             x0 = x0.contiguous()
2025-05-07T20:31:45.1014421Z             x1 = x1.contiguous()
2025-05-07T20:31:45.1014657Z     
2025-05-07T20:31:45.1014851Z         if scale_ub is not None:
2025-05-07T20:31:45.1015123Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.1015454Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.1015764Z             )
2025-05-07T20:31:45.1015963Z         else:
2025-05-07T20:31:45.1016169Z             scale_ub_tensor = None
2025-05-07T20:31:45.1016417Z     
2025-05-07T20:31:45.1016649Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.1016957Z             op = silu_mul_quant
2025-05-07T20:31:45.1017342Z             if compiled:
2025-05-07T20:31:45.1017594Z                 op = torch.compile(op)
2025-05-07T20:31:45.1017886Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.1018160Z     
2025-05-07T20:31:45.1018354Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.1018520Z 
2025-05-07T20:31:45.1018622Z moe/activation_test.py:117: 
2025-05-07T20:31:45.1019024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.1019356Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.1019639Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.1020197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.1020758Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.1021422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.1022124Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.1022716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.1023402Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.1024063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.1024604Z     kernel = self.compile(
2025-05-07T20:31:45.1025144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.1025798Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.1026196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.1026426Z 
2025-05-07T20:31:45.1026634Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02ed3880>
2025-05-07T20:31:45.1027733Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.1029124Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02b5d550>}
2025-05-07T20:31:45.1030537Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.1031570Z context = <triton._C.libtriton.ir.context object at 0x7faa02b4a730>
2025-05-07T20:31:45.1031858Z 
2025-05-07T20:31:45.1032027Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.1032550Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.1033023Z                            module_map=module_map)
2025-05-07T20:31:45.1033384Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.1033739Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.1033993Z E       ^
2025-05-07T20:31:45.1034455Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.1034918Z 
2025-05-07T20:31:45.1035336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.1035855Z 
2025-05-07T20:31:45.4949534Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4950909Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4952000Z     T=4096,
2025-05-07T20:31:45.4952397Z     D=5120,
2025-05-07T20:31:45.4952626Z     scale_ub=1200.0,
2025-05-07T20:31:45.4952854Z     contiguous=False,
2025-05-07T20:31:45.4953079Z     compiled=False,
2025-05-07T20:31:45.4953283Z )
2025-05-07T20:31:45.4953781Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4954292Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:45.4954573Z 
2025-05-07T20:31:45.4954651Z     @given(
2025-05-07T20:31:45.4954885Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4955316Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4955628Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4955958Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4956292Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4956582Z     )
2025-05-07T20:31:45.4956930Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4957379Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4957623Z         self,
2025-05-07T20:31:45.4957817Z         T: int,
2025-05-07T20:31:45.4958017Z         D: int,
2025-05-07T20:31:45.4958250Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4958523Z         contiguous: bool,
2025-05-07T20:31:45.4958762Z         compiled: bool,
2025-05-07T20:31:45.4958993Z     ) -> None:
2025-05-07T20:31:45.4959206Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4959454Z     
2025-05-07T20:31:45.4959728Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4960083Z     
2025-05-07T20:31:45.4960274Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4960568Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4960879Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4961115Z         x0 = x[:, :D]
2025-05-07T20:31:45.4961330Z         x1 = x[:, D:]
2025-05-07T20:31:45.4961544Z     
2025-05-07T20:31:45.4961729Z         if contiguous:
2025-05-07T20:31:45.4961964Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4962223Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4962460Z     
2025-05-07T20:31:45.4962652Z         if scale_ub is not None:
2025-05-07T20:31:45.4962933Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4963266Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4963575Z             )
2025-05-07T20:31:45.4963774Z         else:
2025-05-07T20:31:45.4963982Z             scale_ub_tensor = None
2025-05-07T20:31:45.4964243Z     
2025-05-07T20:31:45.4964474Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4964794Z             op = silu_mul_quant
2025-05-07T20:31:45.4965042Z             if compiled:
2025-05-07T20:31:45.4965300Z                 op = torch.compile(op)
2025-05-07T20:31:45.4965602Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4965873Z     
2025-05-07T20:31:45.4966067Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4966234Z 
2025-05-07T20:31:45.4966343Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4966636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4966973Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4967252Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4967941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4968641Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4969195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4969881Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4970537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4971079Z     kernel = self.compile(
2025-05-07T20:31:45.4971627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4972289Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4972787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4973021Z 
2025-05-07T20:31:45.4973230Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02a8bf10>
2025-05-07T20:31:45.4974336Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4975839Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02ebd0d0>}
2025-05-07T20:31:45.4977206Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4978258Z context = <triton._C.libtriton.ir.context object at 0x7faa02e943f0>
2025-05-07T20:31:45.4978557Z 
2025-05-07T20:31:45.4978726Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4979254Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4979726Z                            module_map=module_map)
2025-05-07T20:31:45.4980109Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4980469Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4980733Z E       ^
2025-05-07T20:31:45.4981209Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4981670Z 
2025-05-07T20:31:45.4982091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4982603Z 
2025-05-07T20:31:45.4982710Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4983122Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4983528Z     T=4096,
2025-05-07T20:31:45.4983718Z     D=5120,
2025-05-07T20:31:45.4983908Z     scale_ub=1200.0,
2025-05-07T20:31:45.4984137Z     contiguous=False,
2025-05-07T20:31:45.4984358Z     compiled=True,
2025-05-07T20:31:45.4984564Z )
2025-05-07T20:31:45.4984880Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4985369Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.4985642Z 
2025-05-07T20:31:45.4985727Z     @given(
2025-05-07T20:31:45.4985952Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4986261Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4986566Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4986887Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4987219Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4987507Z     )
2025-05-07T20:31:45.4987849Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4988293Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4988534Z         self,
2025-05-07T20:31:45.4988731Z         T: int,
2025-05-07T20:31:45.4988928Z         D: int,
2025-05-07T20:31:45.4989146Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4989418Z         contiguous: bool,
2025-05-07T20:31:45.4989656Z         compiled: bool,
2025-05-07T20:31:45.4989950Z     ) -> None:
2025-05-07T20:31:45.4990162Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4990396Z     
2025-05-07T20:31:45.4990667Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4991008Z     
2025-05-07T20:31:45.4991199Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4991488Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4991795Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4992033Z         x0 = x[:, :D]
2025-05-07T20:31:45.4992334Z         x1 = x[:, D:]
2025-05-07T20:31:45.4992547Z     
2025-05-07T20:31:45.4992744Z         if contiguous:
2025-05-07T20:31:45.4992978Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4993234Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4993474Z     
2025-05-07T20:31:45.4993661Z         if scale_ub is not None:
2025-05-07T20:31:45.4994013Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4994342Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4994652Z             )
2025-05-07T20:31:45.4994847Z         else:
2025-05-07T20:31:45.4995057Z             scale_ub_tensor = None
2025-05-07T20:31:45.4995308Z     
2025-05-07T20:31:45.4995536Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4995846Z             op = silu_mul_quant
2025-05-07T20:31:45.4996099Z             if compiled:
2025-05-07T20:31:45.4996343Z                 op = torch.compile(op)
2025-05-07T20:31:45.4996646Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4996919Z     
2025-05-07T20:31:45.4997107Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4997275Z 
2025-05-07T20:31:45.4997375Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4997667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4998003Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4998279Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4998831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.4999387Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.5000037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5000720Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5001251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5001931Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5002582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5003114Z     kernel = self.compile(
2025-05-07T20:31:45.5003654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5004475Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5004869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5005103Z 
2025-05-07T20:31:45.5005309Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02edce20>
2025-05-07T20:31:45.5006392Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5007760Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02ebddc0>}
2025-05-07T20:31:45.5009101Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5010124Z context = <triton._C.libtriton.ir.context object at 0x7faa029ce670>
2025-05-07T20:31:45.5010409Z 
2025-05-07T20:31:45.5010578Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5011100Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5011560Z                            module_map=module_map)
2025-05-07T20:31:45.5011927Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5012407Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5012667Z E       ^
2025-05-07T20:31:45.5013134Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5013581Z 
2025-05-07T20:31:45.5014000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5014617Z 
2025-05-07T20:31:45.7773702Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.7774286Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.7774878Z     T=2048,
2025-05-07T20:31:45.7775123Z     D=7168,
2025-05-07T20:31:45.7775370Z     scale_ub=1200.0,
2025-05-07T20:31:45.7775669Z     contiguous=False,
2025-05-07T20:31:45.7775962Z     compiled=False,
2025-05-07T20:31:45.7776224Z )
2025-05-07T20:31:45.7776643Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.7777150Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:45.7777430Z 
2025-05-07T20:31:45.7777510Z     @given(
2025-05-07T20:31:45.7777737Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.7778048Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.7778350Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.7778683Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.7779013Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.7779292Z     )
2025-05-07T20:31:45.7779640Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.7780080Z     def test_silu_mul_quant(
2025-05-07T20:31:45.7780318Z         self,
2025-05-07T20:31:45.7780509Z         T: int,
2025-05-07T20:31:45.7780700Z         D: int,
2025-05-07T20:31:45.7780915Z         scale_ub: Optional[float],
2025-05-07T20:31:45.7781186Z         contiguous: bool,
2025-05-07T20:31:45.7781424Z         compiled: bool,
2025-05-07T20:31:45.7781647Z     ) -> None:
2025-05-07T20:31:45.7781868Z         torch.manual_seed(2025)
2025-05-07T20:31:45.7782107Z     
2025-05-07T20:31:45.7782371Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.7782711Z     
2025-05-07T20:31:45.7782903Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.7783198Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.7783501Z         x = x_sign * x_clamp
2025-05-07T20:31:45.7783740Z         x0 = x[:, :D]
2025-05-07T20:31:45.7783953Z         x1 = x[:, D:]
2025-05-07T20:31:45.7784153Z     
2025-05-07T20:31:45.7784339Z         if contiguous:
2025-05-07T20:31:45.7784571Z             x0 = x0.contiguous()
2025-05-07T20:31:45.7784824Z             x1 = x1.contiguous()
2025-05-07T20:31:45.7785062Z     
2025-05-07T20:31:45.7785251Z         if scale_ub is not None:
2025-05-07T20:31:45.7785519Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.7785859Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.7786169Z             )
2025-05-07T20:31:45.7786358Z         else:
2025-05-07T20:31:45.7786568Z             scale_ub_tensor = None
2025-05-07T20:31:45.7786822Z     
2025-05-07T20:31:45.7787047Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.7787364Z             op = silu_mul_quant
2025-05-07T20:31:45.7787616Z             if compiled:
2025-05-07T20:31:45.7787860Z                 op = torch.compile(op)
2025-05-07T20:31:45.7788150Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.7788426Z     
2025-05-07T20:31:45.7788616Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.7788783Z 
2025-05-07T20:31:45.7788883Z moe/activation_test.py:117: 
2025-05-07T20:31:45.7789177Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.7789512Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.7789786Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.7790720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.7791415Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.7791952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.7792731Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.7793387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.7793911Z     kernel = self.compile(
2025-05-07T20:31:45.7794440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.7795089Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.7795481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.7795709Z 
2025-05-07T20:31:45.7795930Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02ec2190>
2025-05-07T20:31:45.7797009Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.7798387Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa029de670>}
2025-05-07T20:31:45.7799734Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.7800754Z context = <triton._C.libtriton.ir.context object at 0x7faa028daf30>
2025-05-07T20:31:45.7801038Z 
2025-05-07T20:31:45.7801211Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.7801725Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.7802190Z                            module_map=module_map)
2025-05-07T20:31:45.7802551Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.7802899Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.7803160Z E       ^
2025-05-07T20:31:45.7803629Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.7804358Z 
2025-05-07T20:31:45.7804781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.7805291Z 
2025-05-07T20:31:45.7805395Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.7805805Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.7806201Z     T=1,
2025-05-07T20:31:45.7806375Z     D=7168,
2025-05-07T20:31:45.7806579Z     scale_ub=None,
2025-05-07T20:31:45.7806796Z     contiguous=True,
2025-05-07T20:31:45.7807025Z     compiled=False,
2025-05-07T20:31:45.7807242Z )
2025-05-07T20:31:45.7807563Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.7814545Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.7814826Z 
2025-05-07T20:31:45.7814908Z     @given(
2025-05-07T20:31:45.7815133Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.7815449Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.7815760Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.7816084Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.7816408Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.7816692Z     )
2025-05-07T20:31:45.7817042Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.7817654Z     def test_silu_mul_quant(
2025-05-07T20:31:45.7817903Z         self,
2025-05-07T20:31:45.7818095Z         T: int,
2025-05-07T20:31:45.7818287Z         D: int,
2025-05-07T20:31:45.7818501Z         scale_ub: Optional[float],
2025-05-07T20:31:45.7818777Z         contiguous: bool,
2025-05-07T20:31:45.7819017Z         compiled: bool,
2025-05-07T20:31:45.7819354Z     ) -> None:
2025-05-07T20:31:45.7819570Z         torch.manual_seed(2025)
2025-05-07T20:31:45.7819804Z     
2025-05-07T20:31:45.7820076Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.7820417Z     
2025-05-07T20:31:45.7820606Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.7820905Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.7821219Z         x = x_sign * x_clamp
2025-05-07T20:31:45.7821458Z         x0 = x[:, :D]
2025-05-07T20:31:45.7821677Z         x1 = x[:, D:]
2025-05-07T20:31:45.7821891Z     
2025-05-07T20:31:45.7822069Z         if contiguous:
2025-05-07T20:31:45.7822310Z             x0 = x0.contiguous()
2025-05-07T20:31:45.7822577Z             x1 = x1.contiguous()
2025-05-07T20:31:45.7822824Z     
2025-05-07T20:31:45.7823009Z         if scale_ub is not None:
2025-05-07T20:31:45.7823286Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.7823620Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.7823928Z             )
2025-05-07T20:31:45.7824121Z         else:
2025-05-07T20:31:45.7824332Z             scale_ub_tensor = None
2025-05-07T20:31:45.7824575Z     
2025-05-07T20:31:45.7824809Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.7825125Z             op = silu_mul_quant
2025-05-07T20:31:45.7825371Z             if compiled:
2025-05-07T20:31:45.7825613Z                 op = torch.compile(op)
2025-05-07T20:31:45.7825903Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.7826172Z     
2025-05-07T20:31:45.7826361Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.7826521Z 
2025-05-07T20:31:45.7826630Z moe/activation_test.py:117: 
2025-05-07T20:31:45.7826921Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.7827244Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.7827524Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.7828212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.7828906Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.7829441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.7830187Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.7830839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.7831357Z     kernel = self.compile(
2025-05-07T20:31:45.7831893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.7832541Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.7832928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.7833158Z 
2025-05-07T20:31:45.7833370Z self = <triton.compiler.compiler.ASTSource object at 0x7faa029fd3d0>
2025-05-07T20:31:45.7834454Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.7835828Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa0285c160>}
2025-05-07T20:31:45.7837257Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.7838283Z context = <triton._C.libtriton.ir.context object at 0x7faa02846e70>
2025-05-07T20:31:45.7838569Z 
2025-05-07T20:31:45.7838732Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.7839328Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.7839792Z                            module_map=module_map)
2025-05-07T20:31:45.7840154Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.7840508Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.7840769Z E       ^
2025-05-07T20:31:45.7841232Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.7841687Z 
2025-05-07T20:31:45.7842113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.7842625Z 
2025-05-07T20:31:45.7842728Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.7843140Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.7843537Z     T=16384,
2025-05-07T20:31:45.7843732Z     D=7168,
2025-05-07T20:31:45.7843927Z     scale_ub=1200.0,
2025-05-07T20:31:45.7844145Z     contiguous=False,
2025-05-07T20:31:45.7844368Z     compiled=True,
2025-05-07T20:31:45.7844566Z )
2025-05-07T20:31:45.9744898Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.9745435Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.9745725Z 
2025-05-07T20:31:45.9745807Z     @given(
2025-05-07T20:31:45.9746101Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.9746509Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.9746827Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.9747158Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.9747482Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.9747771Z     )
2025-05-07T20:31:45.9748120Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.9748564Z     def test_silu_mul_quant(
2025-05-07T20:31:45.9748808Z         self,
2025-05-07T20:31:45.9749006Z         T: int,
2025-05-07T20:31:45.9749211Z         D: int,
2025-05-07T20:31:45.9749424Z         scale_ub: Optional[float],
2025-05-07T20:31:45.9749700Z         contiguous: bool,
2025-05-07T20:31:45.9750013Z         compiled: bool,
2025-05-07T20:31:45.9750233Z     ) -> None:
2025-05-07T20:31:45.9750449Z         torch.manual_seed(2025)
2025-05-07T20:31:45.9750693Z     
2025-05-07T20:31:45.9750959Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.9751300Z     
2025-05-07T20:31:45.9751491Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.9751784Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.9752094Z         x = x_sign * x_clamp
2025-05-07T20:31:45.9752333Z         x0 = x[:, :D]
2025-05-07T20:31:45.9752563Z         x1 = x[:, D:]
2025-05-07T20:31:45.9752793Z     
2025-05-07T20:31:45.9752979Z         if contiguous:
2025-05-07T20:31:45.9753210Z             x0 = x0.contiguous()
2025-05-07T20:31:45.9753470Z             x1 = x1.contiguous()
2025-05-07T20:31:45.9753712Z     
2025-05-07T20:31:45.9753898Z         if scale_ub is not None:
2025-05-07T20:31:45.9754174Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.9754509Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.9754816Z             )
2025-05-07T20:31:45.9755007Z         else:
2025-05-07T20:31:45.9755220Z             scale_ub_tensor = None
2025-05-07T20:31:45.9755473Z     
2025-05-07T20:31:45.9755707Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.9756180Z             op = silu_mul_quant
2025-05-07T20:31:45.9756439Z             if compiled:
2025-05-07T20:31:45.9756690Z                 op = torch.compile(op)
2025-05-07T20:31:45.9756990Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.9757263Z     
2025-05-07T20:31:45.9757449Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.9757729Z 
2025-05-07T20:31:45.9757828Z moe/activation_test.py:117: 
2025-05-07T20:31:45.9758129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.9758455Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.9758759Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.9759330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.9759886Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.9760541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.9761231Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.9761762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.9762438Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.9763094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.9763624Z     kernel = self.compile(
2025-05-07T20:31:45.9764162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.9764815Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.9765201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.9765431Z 
2025-05-07T20:31:45.9765639Z self = <triton.compiler.compiler.ASTSource object at 0x7faa0284dd30>
2025-05-07T20:31:45.9766728Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.9768106Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa0285c4c0>}
2025-05-07T20:31:45.9769461Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.9770480Z context = <triton._C.libtriton.ir.context object at 0x7faa02b073f0>
2025-05-07T20:31:45.9770769Z 
2025-05-07T20:31:45.9770933Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.9771451Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.9771920Z                            module_map=module_map)
2025-05-07T20:31:45.9772285Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.9772640Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.9772896Z E       ^
2025-05-07T20:31:45.9773360Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.9773815Z 
2025-05-07T20:31:45.9774228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.9774742Z 
2025-05-07T20:31:45.9774846Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.9775258Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.9775651Z     T=1,
2025-05-07T20:31:45.9775832Z     D=7168,
2025-05-07T20:31:45.9776027Z     scale_ub=None,
2025-05-07T20:31:45.9776241Z     contiguous=False,
2025-05-07T20:31:45.9776467Z     compiled=False,
2025-05-07T20:31:45.9776752Z )
2025-05-07T20:31:45.9777064Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.9777550Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:45.9777810Z 
2025-05-07T20:31:45.9777891Z     @given(
2025-05-07T20:31:45.9778218Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.9778530Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.9778836Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.9779163Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.9779489Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.9779772Z     )
2025-05-07T20:31:45.9780115Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.9780552Z     def test_silu_mul_quant(
2025-05-07T20:31:45.9780792Z         self,
2025-05-07T20:31:45.9780988Z         T: int,
2025-05-07T20:31:45.9781185Z         D: int,
2025-05-07T20:31:45.9781402Z         scale_ub: Optional[float],
2025-05-07T20:31:45.9781674Z         contiguous: bool,
2025-05-07T20:31:45.9781907Z         compiled: bool,
2025-05-07T20:31:45.9782132Z     ) -> None:
2025-05-07T20:31:45.9782348Z         torch.manual_seed(2025)
2025-05-07T20:31:45.9782586Z     
2025-05-07T20:31:45.9782863Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.9783203Z     
2025-05-07T20:31:45.9783392Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.9783682Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.9783987Z         x = x_sign * x_clamp
2025-05-07T20:31:45.9784229Z         x0 = x[:, :D]
2025-05-07T20:31:45.9784437Z         x1 = x[:, D:]
2025-05-07T20:31:45.9784644Z     
2025-05-07T20:31:45.9784828Z         if contiguous:
2025-05-07T20:31:45.9785055Z             x0 = x0.contiguous()
2025-05-07T20:31:45.9785312Z             x1 = x1.contiguous()
2025-05-07T20:31:45.9785552Z     
2025-05-07T20:31:45.9785743Z         if scale_ub is not None:
2025-05-07T20:31:45.9786013Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.9786342Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.9786644Z             )
2025-05-07T20:31:45.9786840Z         else:
2025-05-07T20:31:45.9787061Z             scale_ub_tensor = None
2025-05-07T20:31:45.9787306Z     
2025-05-07T20:31:45.9787531Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.9787844Z             op = silu_mul_quant
2025-05-07T20:31:45.9788092Z             if compiled:
2025-05-07T20:31:45.9788330Z                 op = torch.compile(op)
2025-05-07T20:31:45.9788626Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.9788899Z     
2025-05-07T20:31:45.9789082Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.9789248Z 
2025-05-07T20:31:45.9789344Z moe/activation_test.py:117: 
2025-05-07T20:31:45.9789642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.9790027Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.9790304Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.9790991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.9791682Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.9792209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.9792886Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.9793537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.9794061Z     kernel = self.compile(
2025-05-07T20:31:45.9794596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.9795329Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.9795726Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.9795951Z 
2025-05-07T20:31:45.9796158Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02d14040>
2025-05-07T20:31:45.9797241Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.9798690Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02d1c820>}
2025-05-07T20:31:45.9800038Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.9801059Z context = <triton._C.libtriton.ir.context object at 0x7faa02911430>
2025-05-07T20:31:45.9801349Z 
2025-05-07T20:31:45.9801514Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.9802038Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.9802509Z                            module_map=module_map)
2025-05-07T20:31:45.9802867Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.9803220Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.9803479Z E       ^
2025-05-07T20:31:45.9804109Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.9804565Z 
2025-05-07T20:31:45.9804979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.9805497Z 
2025-05-07T20:31:45.9805603Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.9806027Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.9806421Z     T=2048,
2025-05-07T20:31:45.9806610Z     D=7168,
2025-05-07T20:31:45.9806803Z     scale_ub=None,
2025-05-07T20:31:45.9807011Z     contiguous=False,
2025-05-07T20:31:45.9807233Z     compiled=True,
2025-05-07T20:31:45.9807446Z )
2025-05-07T20:31:46.0997512Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.0998083Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:46.0998496Z 
2025-05-07T20:31:46.0998605Z     @given(
2025-05-07T20:31:46.0998884Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.0999187Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.0999494Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.0999822Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.1000160Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.1000440Z     )
2025-05-07T20:31:46.1000787Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.1001229Z     def test_silu_mul_quant(
2025-05-07T20:31:46.1001464Z         self,
2025-05-07T20:31:46.1001654Z         T: int,
2025-05-07T20:31:46.1001860Z         D: int,
2025-05-07T20:31:46.1002084Z         scale_ub: Optional[float],
2025-05-07T20:31:46.1002353Z         contiguous: bool,
2025-05-07T20:31:46.1002594Z         compiled: bool,
2025-05-07T20:31:46.1002814Z     ) -> None:
2025-05-07T20:31:46.1003027Z         torch.manual_seed(2025)
2025-05-07T20:31:46.1003272Z     
2025-05-07T20:31:46.1003545Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.1004124Z     
2025-05-07T20:31:46.1004321Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.1004610Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.1004918Z         x = x_sign * x_clamp
2025-05-07T20:31:46.1005318Z         x0 = x[:, :D]
2025-05-07T20:31:46.1005539Z         x1 = x[:, D:]
2025-05-07T20:31:46.1005741Z     
2025-05-07T20:31:46.1005932Z         if contiguous:
2025-05-07T20:31:46.1006163Z             x0 = x0.contiguous()
2025-05-07T20:31:46.1006419Z             x1 = x1.contiguous()
2025-05-07T20:31:46.1006657Z     
2025-05-07T20:31:46.1006962Z         if scale_ub is not None:
2025-05-07T20:31:46.1007232Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.1007569Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.1007876Z             )
2025-05-07T20:31:46.1008065Z         else:
2025-05-07T20:31:46.1008274Z             scale_ub_tensor = None
2025-05-07T20:31:46.1008524Z     
2025-05-07T20:31:46.1008753Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.1009072Z             op = silu_mul_quant
2025-05-07T20:31:46.1009325Z             if compiled:
2025-05-07T20:31:46.1009566Z                 op = torch.compile(op)
2025-05-07T20:31:46.1009869Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.1010147Z     
2025-05-07T20:31:46.1010339Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.1010503Z 
2025-05-07T20:31:46.1010602Z moe/activation_test.py:117: 
2025-05-07T20:31:46.1010893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.1011230Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.1011508Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.1012064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.1012623Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.1013273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.1013958Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.1014497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.1015173Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.1015824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.1016354Z     kernel = self.compile(
2025-05-07T20:31:46.1016896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.1017539Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.1017925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.1018158Z 
2025-05-07T20:31:46.1018369Z self = <triton.compiler.compiler.ASTSource object at 0x7faa0293a610>
2025-05-07T20:31:46.1019450Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.1020825Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02acf790>}
2025-05-07T20:31:46.1022163Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.1023190Z context = <triton._C.libtriton.ir.context object at 0x7faa027b9e70>
2025-05-07T20:31:46.1023477Z 
2025-05-07T20:31:46.1023645Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.1024165Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.1024622Z                            module_map=module_map)
2025-05-07T20:31:46.1025067Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.1025425Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.1025679Z E       ^
2025-05-07T20:31:46.1026147Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.1026598Z 
2025-05-07T20:31:46.1027014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.1027597Z 
2025-05-07T20:31:46.1027706Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.1028115Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.1028523Z     T=4096,
2025-05-07T20:31:46.1028710Z     D=7168,
2025-05-07T20:31:46.1028898Z     scale_ub=None,
2025-05-07T20:31:46.1029108Z     contiguous=False,
2025-05-07T20:31:46.1029332Z     compiled=True,
2025-05-07T20:31:46.1029532Z )
2025-05-07T20:31:46.1029910Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.1030412Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:46.1030688Z 
2025-05-07T20:31:46.1030768Z     @given(
2025-05-07T20:31:46.1030992Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.1031299Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.1031610Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.1031934Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.1032270Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.1032605Z     )
2025-05-07T20:31:46.1032949Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.1033401Z     def test_silu_mul_quant(
2025-05-07T20:31:46.1033640Z         self,
2025-05-07T20:31:46.1033833Z         T: int,
2025-05-07T20:31:46.1034028Z         D: int,
2025-05-07T20:31:46.1034237Z         scale_ub: Optional[float],
2025-05-07T20:31:46.1034506Z         contiguous: bool,
2025-05-07T20:31:46.1034749Z         compiled: bool,
2025-05-07T20:31:46.1034968Z     ) -> None:
2025-05-07T20:31:46.1035198Z         torch.manual_seed(2025)
2025-05-07T20:31:46.1035438Z     
2025-05-07T20:31:46.1035703Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.1036049Z     
2025-05-07T20:31:46.1036246Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.1036531Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.1036840Z         x = x_sign * x_clamp
2025-05-07T20:31:46.1037078Z         x0 = x[:, :D]
2025-05-07T20:31:46.1037288Z         x1 = x[:, D:]
2025-05-07T20:31:46.1037497Z     
2025-05-07T20:31:46.1037678Z         if contiguous:
2025-05-07T20:31:46.1037903Z             x0 = x0.contiguous()
2025-05-07T20:31:46.1038165Z             x1 = x1.contiguous()
2025-05-07T20:31:46.1038403Z     
2025-05-07T20:31:46.1038593Z         if scale_ub is not None:
2025-05-07T20:31:46.1038862Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.1039200Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.1039507Z             )
2025-05-07T20:31:46.1039691Z         else:
2025-05-07T20:31:46.1039905Z             scale_ub_tensor = None
2025-05-07T20:31:46.1040156Z     
2025-05-07T20:31:46.1040394Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.1040721Z             op = silu_mul_quant
2025-05-07T20:31:46.1040970Z             if compiled:
2025-05-07T20:31:46.1041226Z                 op = torch.compile(op)
2025-05-07T20:31:46.1048031Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.1048320Z     
2025-05-07T20:31:46.1048511Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.1048680Z 
2025-05-07T20:31:46.1048785Z moe/activation_test.py:117: 
2025-05-07T20:31:46.1049082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.1049413Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.1049691Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.1050360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.1050921Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.1051580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.1052331Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.1052859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.1053531Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.1054188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.1054713Z     kernel = self.compile(
2025-05-07T20:31:46.1055253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.1055909Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.1056300Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.1056534Z 
2025-05-07T20:31:46.1056738Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02791370>
2025-05-07T20:31:46.1057823Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.1059196Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa028114c0>}
2025-05-07T20:31:46.1060539Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.1061551Z context = <triton._C.libtriton.ir.context object at 0x7faa026755f0>
2025-05-07T20:31:46.1061846Z 
2025-05-07T20:31:46.1062009Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.1062571Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.1063049Z                            module_map=module_map)
2025-05-07T20:31:46.1063410Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.1063765Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.1064023Z E       ^
2025-05-07T20:31:46.1064488Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.1064943Z 
2025-05-07T20:31:46.1065359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.1065873Z 
2025-05-07T20:31:46.3122487Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.3122954Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.3123440Z     T=16384,
2025-05-07T20:31:46.3123707Z     D=5120,
2025-05-07T20:31:46.3123971Z     scale_ub=1200.0,
2025-05-07T20:31:46.3124276Z     contiguous=False,
2025-05-07T20:31:46.3124582Z     compiled=False,
2025-05-07T20:31:46.3124828Z )
2025-05-07T20:31:46.3125146Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.3125640Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:46.3125925Z 
2025-05-07T20:31:46.3126005Z     @given(
2025-05-07T20:31:46.3126234Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.3126545Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.3126854Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.3127183Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.3127684Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.3127971Z     )
2025-05-07T20:31:46.3128319Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.3128763Z     def test_silu_mul_quant(
2025-05-07T20:31:46.3128999Z         self,
2025-05-07T20:31:46.3129306Z         T: int,
2025-05-07T20:31:46.3129495Z         D: int,
2025-05-07T20:31:46.3129705Z         scale_ub: Optional[float],
2025-05-07T20:31:46.3129975Z         contiguous: bool,
2025-05-07T20:31:46.3130216Z         compiled: bool,
2025-05-07T20:31:46.3130437Z     ) -> None:
2025-05-07T20:31:46.3130656Z         torch.manual_seed(2025)
2025-05-07T20:31:46.3130903Z     
2025-05-07T20:31:46.3131186Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.3131522Z     
2025-05-07T20:31:46.3131719Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.3132009Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.3132323Z         x = x_sign * x_clamp
2025-05-07T20:31:46.3132561Z         x0 = x[:, :D]
2025-05-07T20:31:46.3132781Z         x1 = x[:, D:]
2025-05-07T20:31:46.3132984Z     
2025-05-07T20:31:46.3133170Z         if contiguous:
2025-05-07T20:31:46.3133435Z             x0 = x0.contiguous()
2025-05-07T20:31:46.3133695Z             x1 = x1.contiguous()
2025-05-07T20:31:46.3133949Z     
2025-05-07T20:31:46.3134135Z         if scale_ub is not None:
2025-05-07T20:31:46.3134408Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.3134755Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.3135058Z             )
2025-05-07T20:31:46.3135251Z         else:
2025-05-07T20:31:46.3135460Z             scale_ub_tensor = None
2025-05-07T20:31:46.3135708Z     
2025-05-07T20:31:46.3135935Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.3136247Z             op = silu_mul_quant
2025-05-07T20:31:46.3136494Z             if compiled:
2025-05-07T20:31:46.3136739Z                 op = torch.compile(op)
2025-05-07T20:31:46.3137041Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.3137312Z     
2025-05-07T20:31:46.3137502Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.3137673Z 
2025-05-07T20:31:46.3137772Z moe/activation_test.py:117: 
2025-05-07T20:31:46.3138070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.3138396Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.3138677Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.3139368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.3140057Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.3140588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.3141264Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.3141928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.3142447Z     kernel = self.compile(
2025-05-07T20:31:46.3142983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.3143633Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.3144029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.3144253Z 
2025-05-07T20:31:46.3144457Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02791550>
2025-05-07T20:31:46.3145543Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.3146998Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02811820>}
2025-05-07T20:31:46.3148345Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.3149440Z context = <triton._C.libtriton.ir.context object at 0x7faa02a69eb0>
2025-05-07T20:31:46.3149725Z 
2025-05-07T20:31:46.3149962Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.3150479Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.3150945Z                            module_map=module_map)
2025-05-07T20:31:46.3151302Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.3151651Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.3151909Z E       ^
2025-05-07T20:31:46.3152380Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.3152831Z 
2025-05-07T20:31:46.3153246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.3153761Z 
2025-05-07T20:31:46.3153864Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.3154276Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.3154665Z     T=16384,
2025-05-07T20:31:46.3154856Z     D=5120,
2025-05-07T20:31:46.3155046Z     scale_ub=1200.0,
2025-05-07T20:31:46.3155263Z     contiguous=True,
2025-05-07T20:31:46.3155486Z     compiled=True,
2025-05-07T20:31:46.3155687Z )
2025-05-07T20:31:46.3156006Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.3156497Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:46.3156770Z 
2025-05-07T20:31:46.3156850Z     @given(
2025-05-07T20:31:46.3157079Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.3157383Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.3157686Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.3158014Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.3158345Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.3158626Z     )
2025-05-07T20:31:46.3158976Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.3159410Z     def test_silu_mul_quant(
2025-05-07T20:31:46.3159641Z         self,
2025-05-07T20:31:46.3159829Z         T: int,
2025-05-07T20:31:46.3160023Z         D: int,
2025-05-07T20:31:46.3160232Z         scale_ub: Optional[float],
2025-05-07T20:31:46.3160503Z         contiguous: bool,
2025-05-07T20:31:46.3160738Z         compiled: bool,
2025-05-07T20:31:46.3160959Z     ) -> None:
2025-05-07T20:31:46.3161178Z         torch.manual_seed(2025)
2025-05-07T20:31:46.3161416Z     
2025-05-07T20:31:46.3161680Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.3162017Z     
2025-05-07T20:31:46.3162207Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.3162495Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.3162806Z         x = x_sign * x_clamp
2025-05-07T20:31:46.3163045Z         x0 = x[:, :D]
2025-05-07T20:31:46.3163258Z         x1 = x[:, D:]
2025-05-07T20:31:46.3163460Z     
2025-05-07T20:31:46.3163648Z         if contiguous:
2025-05-07T20:31:46.3163869Z             x0 = x0.contiguous()
2025-05-07T20:31:46.3164124Z             x1 = x1.contiguous()
2025-05-07T20:31:46.3164360Z     
2025-05-07T20:31:46.3164545Z         if scale_ub is not None:
2025-05-07T20:31:46.3164813Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.3165143Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.3165452Z             )
2025-05-07T20:31:46.3165721Z         else:
2025-05-07T20:31:46.3165931Z             scale_ub_tensor = None
2025-05-07T20:31:46.3166178Z     
2025-05-07T20:31:46.3166402Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.3166712Z             op = silu_mul_quant
2025-05-07T20:31:46.3166957Z             if compiled:
2025-05-07T20:31:46.3167297Z                 op = torch.compile(op)
2025-05-07T20:31:46.3167592Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.3167866Z     
2025-05-07T20:31:46.3168051Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.3168219Z 
2025-05-07T20:31:46.3168317Z moe/activation_test.py:117: 
2025-05-07T20:31:46.3168612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.3168940Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.3169211Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.3169761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.3170320Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.3170967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.3171650Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.3172179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.3172860Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.3173509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.3174036Z     kernel = self.compile(
2025-05-07T20:31:46.3174565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.3175206Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.3175598Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.3175831Z 
2025-05-07T20:31:46.3176035Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02a81fa0>
2025-05-07T20:31:46.3177114Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.3178492Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02a6be50>}
2025-05-07T20:31:46.3179830Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.3180851Z context = <triton._C.libtriton.ir.context object at 0x7faa028ba730>
2025-05-07T20:31:46.3181139Z 
2025-05-07T20:31:46.3181308Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.3181824Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.3182285Z                            module_map=module_map)
2025-05-07T20:31:46.3182649Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.3182999Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.3183249Z E       ^
2025-05-07T20:31:46.3183716Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.3184164Z 
2025-05-07T20:31:46.3184582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.3185089Z 
2025-05-07T20:31:46.7415331Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7415970Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7416490Z     T=16384,
2025-05-07T20:31:46.7416684Z     D=5120,
2025-05-07T20:31:46.7416888Z     scale_ub=None,
2025-05-07T20:31:46.7417106Z     contiguous=False,
2025-05-07T20:31:46.7417330Z     compiled=True,
2025-05-07T20:31:46.7417544Z )
2025-05-07T20:31:46.7417868Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7418495Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:46.7418777Z 
2025-05-07T20:31:46.7418855Z     @given(
2025-05-07T20:31:46.7419091Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7419409Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7419715Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7420049Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7420385Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7420670Z     )
2025-05-07T20:31:46.7421029Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7421471Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7421714Z         self,
2025-05-07T20:31:46.7421911Z         T: int,
2025-05-07T20:31:46.7422113Z         D: int,
2025-05-07T20:31:46.7422330Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7422609Z         contiguous: bool,
2025-05-07T20:31:46.7422849Z         compiled: bool,
2025-05-07T20:31:46.7423072Z     ) -> None:
2025-05-07T20:31:46.7423291Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7423535Z     
2025-05-07T20:31:46.7423814Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7424152Z     
2025-05-07T20:31:46.7424346Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7424639Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7424945Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7425186Z         x0 = x[:, :D]
2025-05-07T20:31:46.7425404Z         x1 = x[:, D:]
2025-05-07T20:31:46.7425612Z     
2025-05-07T20:31:46.7425807Z         if contiguous:
2025-05-07T20:31:46.7426044Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7426305Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7426554Z     
2025-05-07T20:31:46.7426753Z         if scale_ub is not None:
2025-05-07T20:31:46.7427036Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7427374Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7427687Z             )
2025-05-07T20:31:46.7427886Z         else:
2025-05-07T20:31:46.7428104Z             scale_ub_tensor = None
2025-05-07T20:31:46.7428356Z     
2025-05-07T20:31:46.7428596Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7428914Z             op = silu_mul_quant
2025-05-07T20:31:46.7429173Z             if compiled:
2025-05-07T20:31:46.7429422Z                 op = torch.compile(op)
2025-05-07T20:31:46.7429725Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7430086Z     
2025-05-07T20:31:46.7430281Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7430455Z 
2025-05-07T20:31:46.7430557Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7430853Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7431181Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7431468Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7432028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.7432618Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.7433298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7433999Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7434537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7435346Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7436013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7436549Z     kernel = self.compile(
2025-05-07T20:31:46.7437084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7437815Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7438211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7438438Z 
2025-05-07T20:31:46.7438649Z self = <triton.compiler.compiler.ASTSource object at 0x7faa0288dc70>
2025-05-07T20:31:46.7439729Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7441125Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02ae09d0>}
2025-05-07T20:31:46.7442462Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7443546Z context = <triton._C.libtriton.ir.context object at 0x7faa02538cf0>
2025-05-07T20:31:46.7443836Z 
2025-05-07T20:31:46.7444010Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7444531Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7444999Z                            module_map=module_map)
2025-05-07T20:31:46.7445366Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7445712Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7445979Z E       ^
2025-05-07T20:31:46.7446450Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7446901Z 
2025-05-07T20:31:46.7447321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7447834Z 
2025-05-07T20:31:46.7447938Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7448350Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7448759Z     T=2048,
2025-05-07T20:31:46.7448947Z     D=5120,
2025-05-07T20:31:46.7449137Z     scale_ub=None,
2025-05-07T20:31:46.7449355Z     contiguous=False,
2025-05-07T20:31:46.7449581Z     compiled=True,
2025-05-07T20:31:46.7449789Z )
2025-05-07T20:31:46.8662003Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.8662647Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:46.8663038Z 
2025-05-07T20:31:46.8663154Z     @given(
2025-05-07T20:31:46.8663451Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.8663872Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.8664185Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.8664518Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.8664836Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.8665116Z     )
2025-05-07T20:31:46.8665457Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.8665881Z     def test_silu_mul_quant(
2025-05-07T20:31:46.8666121Z         self,
2025-05-07T20:31:46.8666311Z         T: int,
2025-05-07T20:31:46.8666501Z         D: int,
2025-05-07T20:31:46.8666716Z         scale_ub: Optional[float],
2025-05-07T20:31:46.8666977Z         contiguous: bool,
2025-05-07T20:31:46.8667206Z         compiled: bool,
2025-05-07T20:31:46.8667423Z     ) -> None:
2025-05-07T20:31:46.8667797Z         torch.manual_seed(2025)
2025-05-07T20:31:46.8668036Z     
2025-05-07T20:31:46.8668305Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.8668641Z     
2025-05-07T20:31:46.8668824Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.8669104Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.8669528Z         x = x_sign * x_clamp
2025-05-07T20:31:46.8669763Z         x0 = x[:, :D]
2025-05-07T20:31:46.8670065Z         x1 = x[:, D:]
2025-05-07T20:31:46.8670264Z     
2025-05-07T20:31:46.8670445Z         if contiguous:
2025-05-07T20:31:46.8670666Z             x0 = x0.contiguous()
2025-05-07T20:31:46.8670921Z             x1 = x1.contiguous()
2025-05-07T20:31:46.8671153Z     
2025-05-07T20:31:46.8671334Z         if scale_ub is not None:
2025-05-07T20:31:46.8671600Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.8671931Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.8672234Z             )
2025-05-07T20:31:46.8672428Z         else:
2025-05-07T20:31:46.8672642Z             scale_ub_tensor = None
2025-05-07T20:31:46.8672922Z     
2025-05-07T20:31:46.8673164Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.8673469Z             op = silu_mul_quant
2025-05-07T20:31:46.8673714Z             if compiled:
2025-05-07T20:31:46.8673960Z                 op = torch.compile(op)
2025-05-07T20:31:46.8674253Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.8674527Z     
2025-05-07T20:31:46.8674715Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.8674878Z 
2025-05-07T20:31:46.8674977Z moe/activation_test.py:117: 
2025-05-07T20:31:46.8675269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.8675596Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.8675874Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.8676428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.8676986Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.8677637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.8678317Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.8678848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.8679517Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.8680170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.8680696Z     kernel = self.compile(
2025-05-07T20:31:46.8681222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.8681872Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.8682270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.8682491Z 
2025-05-07T20:31:46.8682697Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02509220>
2025-05-07T20:31:46.8683764Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.8685140Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa0270a550>}
2025-05-07T20:31:46.8686478Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.8687604Z context = <triton._C.libtriton.ir.context object at 0x7faa024af270>
2025-05-07T20:31:46.8687894Z 
2025-05-07T20:31:46.8688061Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.8688571Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.8689107Z                            module_map=module_map)
2025-05-07T20:31:46.8689467Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.8689809Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.8690062Z E       ^
2025-05-07T20:31:46.8690519Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.8690982Z 
2025-05-07T20:31:46.8691397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.8691923Z 
2025-05-07T20:31:46.8692053Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.8692569Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.8693071Z     T=2048,
2025-05-07T20:31:46.8693305Z     D=5120,
2025-05-07T20:31:46.8693534Z     scale_ub=1200.0,
2025-05-07T20:31:46.8700636Z     contiguous=False,
2025-05-07T20:31:46.8700873Z     compiled=True,
2025-05-07T20:31:46.8701086Z )
2025-05-07T20:31:46.8701402Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.8701910Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:46.8702247Z 
2025-05-07T20:31:46.8702350Z     @given(
2025-05-07T20:31:46.8702628Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.8703015Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.8703399Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.8704158Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.8704487Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.8704777Z     )
2025-05-07T20:31:46.8705127Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.8705558Z     def test_silu_mul_quant(
2025-05-07T20:31:46.8705799Z         self,
2025-05-07T20:31:46.8705997Z         T: int,
2025-05-07T20:31:46.8706185Z         D: int,
2025-05-07T20:31:46.8706406Z         scale_ub: Optional[float],
2025-05-07T20:31:46.8706673Z         contiguous: bool,
2025-05-07T20:31:46.8706905Z         compiled: bool,
2025-05-07T20:31:46.8707133Z     ) -> None:
2025-05-07T20:31:46.8707346Z         torch.manual_seed(2025)
2025-05-07T20:31:46.8707581Z     
2025-05-07T20:31:46.8707849Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.8708190Z     
2025-05-07T20:31:46.8708380Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.8708662Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.8708963Z         x = x_sign * x_clamp
2025-05-07T20:31:46.8709209Z         x0 = x[:, :D]
2025-05-07T20:31:46.8709416Z         x1 = x[:, D:]
2025-05-07T20:31:46.8709620Z     
2025-05-07T20:31:46.8709803Z         if contiguous:
2025-05-07T20:31:46.8710093Z             x0 = x0.contiguous()
2025-05-07T20:31:46.8710350Z             x1 = x1.contiguous()
2025-05-07T20:31:46.8710586Z     
2025-05-07T20:31:46.8710773Z         if scale_ub is not None:
2025-05-07T20:31:46.8711039Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.8711370Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.8711668Z             )
2025-05-07T20:31:46.8711855Z         else:
2025-05-07T20:31:46.8712060Z             scale_ub_tensor = None
2025-05-07T20:31:46.8712300Z     
2025-05-07T20:31:46.8712522Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.8712828Z             op = silu_mul_quant
2025-05-07T20:31:46.8713065Z             if compiled:
2025-05-07T20:31:46.8713304Z                 op = torch.compile(op)
2025-05-07T20:31:46.8713762Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.8714038Z     
2025-05-07T20:31:46.8714221Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.8714386Z 
2025-05-07T20:31:46.8714480Z moe/activation_test.py:117: 
2025-05-07T20:31:46.8714769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.8715210Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.8715484Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.8716036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.8716577Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.8717226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.8717908Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.8718437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.8719114Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.8719768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.8720290Z     kernel = self.compile(
2025-05-07T20:31:46.8720830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.8721465Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.8721855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.8722088Z 
2025-05-07T20:31:46.8722300Z self = <triton.compiler.compiler.ASTSource object at 0x7faa024ae670>
2025-05-07T20:31:46.8723433Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.8724804Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02547310>}
2025-05-07T20:31:46.8726142Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.8727173Z context = <triton._C.libtriton.ir.context object at 0x7faa025468b0>
2025-05-07T20:31:46.8727456Z 
2025-05-07T20:31:46.8727626Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.8728139Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.8728594Z                            module_map=module_map)
2025-05-07T20:31:46.8728957Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.8729303Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.8729549Z E       ^
2025-05-07T20:31:46.8730004Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.8730456Z 
2025-05-07T20:31:46.8730874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.8731384Z 
2025-05-07T20:31:47.0983612Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.0984047Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.0984486Z     T=4096,
2025-05-07T20:31:47.0984673Z     D=5120,
2025-05-07T20:31:47.0984868Z     scale_ub=1200.0,
2025-05-07T20:31:47.0985083Z     contiguous=True,
2025-05-07T20:31:47.0985310Z     compiled=True,
2025-05-07T20:31:47.0985514Z )
2025-05-07T20:31:47.0985831Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.0986498Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:47.0986774Z 
2025-05-07T20:31:47.0986857Z     @given(
2025-05-07T20:31:47.0987082Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.0987387Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.0987695Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.0988150Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.0988478Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.0988767Z     )
2025-05-07T20:31:47.0989118Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.0989557Z     def test_silu_mul_quant(
2025-05-07T20:31:47.0989796Z         self,
2025-05-07T20:31:47.0990056Z         T: int,
2025-05-07T20:31:47.0990245Z         D: int,
2025-05-07T20:31:47.0990462Z         scale_ub: Optional[float],
2025-05-07T20:31:47.0990732Z         contiguous: bool,
2025-05-07T20:31:47.0990968Z         compiled: bool,
2025-05-07T20:31:47.0991193Z     ) -> None:
2025-05-07T20:31:47.0991405Z         torch.manual_seed(2025)
2025-05-07T20:31:47.0991646Z     
2025-05-07T20:31:47.0991913Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.0992263Z     
2025-05-07T20:31:47.0992462Z         x_sign = torch.sign(x)
2025-05-07T20:31:47.0992791Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:47.0993104Z         x = x_sign * x_clamp
2025-05-07T20:31:47.0993341Z         x0 = x[:, :D]
2025-05-07T20:31:47.0993553Z         x1 = x[:, D:]
2025-05-07T20:31:47.0993760Z     
2025-05-07T20:31:47.0993942Z         if contiguous:
2025-05-07T20:31:47.0994167Z             x0 = x0.contiguous()
2025-05-07T20:31:47.0994427Z             x1 = x1.contiguous()
2025-05-07T20:31:47.0994666Z     
2025-05-07T20:31:47.0994851Z         if scale_ub is not None:
2025-05-07T20:31:47.0995126Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:47.0995473Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:47.0995778Z             )
2025-05-07T20:31:47.0995965Z         else:
2025-05-07T20:31:47.0996171Z             scale_ub_tensor = None
2025-05-07T20:31:47.0996414Z     
2025-05-07T20:31:47.0996643Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:47.0996959Z             op = silu_mul_quant
2025-05-07T20:31:47.0997209Z             if compiled:
2025-05-07T20:31:47.0997451Z                 op = torch.compile(op)
2025-05-07T20:31:47.0997747Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.0998020Z     
2025-05-07T20:31:47.0998202Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:47.0998373Z 
2025-05-07T20:31:47.0998468Z moe/activation_test.py:117: 
2025-05-07T20:31:47.0998761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.0999088Z moe/activation_test.py:115: in fn
2025-05-07T20:31:47.0999367Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.0999936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:47.1000502Z     return fn(*args, **kwargs)
2025-05-07T20:31:47.1001171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:47.1001877Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:47.1002419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:47.1003161Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:47.1003983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:47.1004529Z     kernel = self.compile(
2025-05-07T20:31:47.1005077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:47.1005854Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:47.1006260Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.1006487Z 
2025-05-07T20:31:47.1006708Z self = <triton.compiler.compiler.ASTSource object at 0x7faa027d0b20>
2025-05-07T20:31:47.1007930Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:47.1009347Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa024ca040>}
2025-05-07T20:31:47.1010729Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:47.1011785Z context = <triton._C.libtriton.ir.context object at 0x7faa024e0030>
2025-05-07T20:31:47.1012081Z 
2025-05-07T20:31:47.1012249Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:47.1012792Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:47.1013264Z                            module_map=module_map)
2025-05-07T20:31:47.1013628Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:47.1013982Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:47.1014236Z E       ^
2025-05-07T20:31:47.1014704Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:47.1015162Z 
2025-05-07T20:31:47.1015591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:47.1016111Z 
2025-05-07T20:31:47.1016225Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.1016643Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.1017055Z     T=128,
2025-05-07T20:31:47.1017237Z     D=5120,
2025-05-07T20:31:47.1017420Z     scale_ub=1200.0,
2025-05-07T20:31:47.1017647Z     contiguous=False,
2025-05-07T20:31:47.1017875Z     compiled=True,
2025-05-07T20:31:47.1018068Z )
2025-05-07T20:31:47.2343504Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.2344055Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:47.2344363Z 
2025-05-07T20:31:47.2344452Z     @given(
2025-05-07T20:31:47.2344779Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.2345183Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.2345527Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.2345855Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.2346189Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.2346473Z     )
2025-05-07T20:31:47.2346821Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.2347253Z     def test_silu_mul_quant(
2025-05-07T20:31:47.2347493Z         self,
2025-05-07T20:31:47.2347688Z         T: int,
2025-05-07T20:31:47.2347887Z         D: int,
2025-05-07T20:31:47.2348101Z         scale_ub: Optional[float],
2025-05-07T20:31:47.2348370Z         contiguous: bool,
2025-05-07T20:31:47.2348615Z         compiled: bool,
2025-05-07T20:31:47.2348839Z     ) -> None:
2025-05-07T20:31:47.2349054Z         torch.manual_seed(2025)
2025-05-07T20:31:47.2349293Z     
2025-05-07T20:31:47.2349565Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.2349983Z     
2025-05-07T20:31:47.2350175Z         x_sign = torch.sign(x)
2025-05-07T20:31:47.2350460Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:47.2350927Z         x = x_sign * x_clamp
2025-05-07T20:31:47.2351169Z         x0 = x[:, :D]
2025-05-07T20:31:47.2351377Z         x1 = x[:, D:]
2025-05-07T20:31:47.2351585Z     
2025-05-07T20:31:47.2351769Z         if contiguous:
2025-05-07T20:31:47.2351989Z             x0 = x0.contiguous()
2025-05-07T20:31:47.2352247Z             x1 = x1.contiguous()
2025-05-07T20:31:47.2352651Z     
2025-05-07T20:31:47.2352856Z         if scale_ub is not None:
2025-05-07T20:31:47.2353123Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:47.2353456Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:47.2353762Z             )
2025-05-07T20:31:47.2353944Z         else:
2025-05-07T20:31:47.2354150Z             scale_ub_tensor = None
2025-05-07T20:31:47.2354394Z     
2025-05-07T20:31:47.2354615Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:47.2354921Z             op = silu_mul_quant
2025-05-07T20:31:47.2355167Z             if compiled:
2025-05-07T20:31:47.2355415Z                 op = torch.compile(op)
2025-05-07T20:31:47.2355708Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.2355977Z     
2025-05-07T20:31:47.2356164Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:47.2356328Z 
2025-05-07T20:31:47.2356426Z moe/activation_test.py:117: 
2025-05-07T20:31:47.2356720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.2357054Z moe/activation_test.py:115: in fn
2025-05-07T20:31:47.2357327Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.2357881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:47.2358431Z     return fn(*args, **kwargs)
2025-05-07T20:31:47.2359086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:47.2359774Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:47.2360307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:47.2360983Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:47.2361636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:47.2362165Z     kernel = self.compile(
2025-05-07T20:31:47.2362712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:47.2363394Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:47.2363783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.2364014Z 
2025-05-07T20:31:47.2364218Z self = <triton.compiler.compiler.ASTSource object at 0x7faa026e4cd0>
2025-05-07T20:31:47.2365298Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:47.2366659Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa024caca0>}
2025-05-07T20:31:47.2368002Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:47.2369019Z context = <triton._C.libtriton.ir.context object at 0x7faa02585530>
2025-05-07T20:31:47.2369306Z 
2025-05-07T20:31:47.2369481Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:47.2369999Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:47.2370460Z                            module_map=module_map)
2025-05-07T20:31:47.2370902Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:47.2371260Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:47.2371510Z E       ^
2025-05-07T20:31:47.2371971Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:47.2372417Z 
2025-05-07T20:31:47.2372908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:47.2373422Z 
2025-05-07T20:31:47.2373524Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.2373936Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.2374343Z     T=16384,
2025-05-07T20:31:47.2374528Z     D=7168,
2025-05-07T20:31:47.2374719Z     scale_ub=1200.0,
2025-05-07T20:31:47.2374950Z     contiguous=True,
2025-05-07T20:31:47.2375164Z     compiled=True,
2025-05-07T20:31:47.2375366Z )
2025-05-07T20:31:47.2375686Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.2376175Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:47.2376457Z 
2025-05-07T20:31:47.2376532Z     @given(
2025-05-07T20:31:47.2376760Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.2377066Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.2377378Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.2377709Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.2378043Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.2378319Z     )
2025-05-07T20:31:47.2378668Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.2379103Z     def test_silu_mul_quant(
2025-05-07T20:31:47.2379338Z         self,
2025-05-07T20:31:47.2379527Z         T: int,
2025-05-07T20:31:47.2379727Z         D: int,
2025-05-07T20:31:47.2379936Z         scale_ub: Optional[float],
2025-05-07T20:31:47.2380207Z         contiguous: bool,
2025-05-07T20:31:47.2380443Z         compiled: bool,
2025-05-07T20:31:47.2380664Z     ) -> None:
2025-05-07T20:31:47.2380875Z         torch.manual_seed(2025)
2025-05-07T20:31:47.2381115Z     
2025-05-07T20:31:47.2381382Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.2381725Z     
2025-05-07T20:31:47.2381917Z         x_sign = torch.sign(x)
2025-05-07T20:31:47.2382198Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:47.2382512Z         x = x_sign * x_clamp
2025-05-07T20:31:47.2382784Z         x0 = x[:, :D]
2025-05-07T20:31:47.2383010Z         x1 = x[:, D:]
2025-05-07T20:31:47.2383209Z     
2025-05-07T20:31:47.2383390Z         if contiguous:
2025-05-07T20:31:47.2383618Z             x0 = x0.contiguous()
2025-05-07T20:31:47.2383868Z             x1 = x1.contiguous()
2025-05-07T20:31:47.2384125Z     
2025-05-07T20:31:47.2384318Z         if scale_ub is not None:
2025-05-07T20:31:47.2384590Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:47.2384920Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:47.2385229Z             )
2025-05-07T20:31:47.2385420Z         else:
2025-05-07T20:31:47.2385631Z             scale_ub_tensor = None
2025-05-07T20:31:47.2385881Z     
2025-05-07T20:31:47.2386108Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:47.2386424Z             op = silu_mul_quant
2025-05-07T20:31:47.2386677Z             if compiled:
2025-05-07T20:31:47.2386924Z                 op = torch.compile(op)
2025-05-07T20:31:47.2387217Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.2387495Z     
2025-05-07T20:31:47.2387689Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:47.2387852Z 
2025-05-07T20:31:47.2387950Z moe/activation_test.py:117: 
2025-05-07T20:31:47.2388247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.2388584Z moe/activation_test.py:115: in fn
2025-05-07T20:31:47.2388943Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.2389496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:47.2390106Z     return fn(*args, **kwargs)
2025-05-07T20:31:47.2390756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:47.2391510Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:47.2392039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:47.2392712Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:47.2393412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:47.2393932Z     kernel = self.compile(
2025-05-07T20:31:47.2394470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:47.2395116Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:47.2395503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.2395731Z 
2025-05-07T20:31:47.2395939Z self = <triton.compiler.compiler.ASTSource object at 0x7faa0258a460>
2025-05-07T20:31:47.2397025Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:47.2398391Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02697a60>}
2025-05-07T20:31:47.2399728Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:47.2400743Z context = <triton._C.libtriton.ir.context object at 0x7faa025c4b70>
2025-05-07T20:31:47.2401033Z 
2025-05-07T20:31:47.2401196Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:47.2401712Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:47.2402180Z                            module_map=module_map)
2025-05-07T20:31:47.2402537Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:47.2402887Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:47.2403142Z E       ^
2025-05-07T20:31:47.2403603Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:47.2404235Z 
2025-05-07T20:31:47.2404651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:47.2405176Z 
2025-05-07T20:31:47.7221192Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.7221616Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.7222047Z     T=16384,
2025-05-07T20:31:47.7222294Z     D=5120,
2025-05-07T20:31:47.7222799Z     scale_ub=1200.0,
2025-05-07T20:31:47.7223271Z     contiguous=True,
2025-05-07T20:31:47.7223697Z     compiled=False,
2025-05-07T20:31:47.7224090Z )
2025-05-07T20:31:47.7224704Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.7225677Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:47.7226225Z 
2025-05-07T20:31:47.7226383Z     @given(
2025-05-07T20:31:47.7226820Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.7227416Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.7228008Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.7228961Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.7229599Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.7230270Z     )
2025-05-07T20:31:47.7230945Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.7231795Z     def test_silu_mul_quant(
2025-05-07T20:31:47.7232493Z         self,
2025-05-07T20:31:47.7232842Z         T: int,
2025-05-07T20:31:47.7233053Z         D: int,
2025-05-07T20:31:47.7233291Z         scale_ub: Optional[float],
2025-05-07T20:31:47.7233560Z         contiguous: bool,
2025-05-07T20:31:47.7233790Z         compiled: bool,
2025-05-07T20:31:47.7234007Z     ) -> None:
2025-05-07T20:31:47.7234215Z         torch.manual_seed(2025)
2025-05-07T20:31:47.7234455Z     
2025-05-07T20:31:47.7234715Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.7235047Z     
2025-05-07T20:31:47.7235234Z         x_sign = torch.sign(x)
2025-05-07T20:31:47.7235531Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:47.7235834Z         x = x_sign * x_clamp
2025-05-07T20:31:47.7236073Z         x0 = x[:, :D]
2025-05-07T20:31:47.7236283Z         x1 = x[:, D:]
2025-05-07T20:31:47.7236475Z     
2025-05-07T20:31:47.7236651Z         if contiguous:
2025-05-07T20:31:47.7236881Z             x0 = x0.contiguous()
2025-05-07T20:31:47.7243104Z             x1 = x1.contiguous()
2025-05-07T20:31:47.7243367Z     
2025-05-07T20:31:47.7243557Z         if scale_ub is not None:
2025-05-07T20:31:47.7243838Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:47.7244186Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:47.7244492Z             )
2025-05-07T20:31:47.7244687Z         else:
2025-05-07T20:31:47.7244900Z             scale_ub_tensor = None
2025-05-07T20:31:47.7245152Z     
2025-05-07T20:31:47.7245396Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:47.7245724Z             op = silu_mul_quant
2025-05-07T20:31:47.7245982Z             if compiled:
2025-05-07T20:31:47.7246236Z                 op = torch.compile(op)
2025-05-07T20:31:47.7246532Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.7246804Z     
2025-05-07T20:31:47.7246992Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:47.7247161Z 
2025-05-07T20:31:47.7247260Z moe/activation_test.py:117: 
2025-05-07T20:31:47.7247563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.7247891Z moe/activation_test.py:115: in fn
2025-05-07T20:31:47.7248168Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.7248862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:47.7249546Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:47.7250082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:47.7250763Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:47.7251422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:47.7251943Z     kernel = self.compile(
2025-05-07T20:31:47.7252485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:47.7253184Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:47.7253575Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.7253799Z 
2025-05-07T20:31:47.7254003Z self = <triton.compiler.compiler.ASTSource object at 0x7faa0235b970>
2025-05-07T20:31:47.7255081Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:47.7256567Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02423550>}
2025-05-07T20:31:47.7257912Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:47.7259007Z context = <triton._C.libtriton.ir.context object at 0x7faa02410670>
2025-05-07T20:31:47.7259297Z 
2025-05-07T20:31:47.7259460Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:47.7259980Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:47.7260442Z                            module_map=module_map)
2025-05-07T20:31:47.7260799Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:47.7261155Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:47.7261417Z E       ^
2025-05-07T20:31:47.7261882Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:47.7262332Z 
2025-05-07T20:31:47.7262745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:47.7263314Z 
2025-05-07T20:31:47.7263415Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.7263842Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.7264249Z     T=1,
2025-05-07T20:31:47.7264423Z     D=7168,
2025-05-07T20:31:47.7264614Z     scale_ub=1200.0,
2025-05-07T20:31:47.7264836Z     contiguous=False,
2025-05-07T20:31:47.7265048Z     compiled=False,
2025-05-07T20:31:47.7265246Z )
2025-05-07T20:31:47.7265562Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.7266050Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:47.7266319Z 
2025-05-07T20:31:47.7266395Z     @given(
2025-05-07T20:31:47.7266612Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.7266918Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.7267211Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.7267534Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.7267862Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.7268145Z     )
2025-05-07T20:31:47.7268484Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.7268917Z     def test_silu_mul_quant(
2025-05-07T20:31:47.7269152Z         self,
2025-05-07T20:31:47.7269334Z         T: int,
2025-05-07T20:31:47.7269525Z         D: int,
2025-05-07T20:31:47.7269738Z         scale_ub: Optional[float],
2025-05-07T20:31:47.7270069Z         contiguous: bool,
2025-05-07T20:31:47.7270308Z         compiled: bool,
2025-05-07T20:31:47.7270528Z     ) -> None:
2025-05-07T20:31:47.7270740Z         torch.manual_seed(2025)
2025-05-07T20:31:47.7270980Z     
2025-05-07T20:31:47.7271250Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.7271584Z     
2025-05-07T20:31:47.7271776Z         x_sign = torch.sign(x)
2025-05-07T20:31:47.7272061Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:47.7272368Z         x = x_sign * x_clamp
2025-05-07T20:31:47.7272599Z         x0 = x[:, :D]
2025-05-07T20:31:47.7272810Z         x1 = x[:, D:]
2025-05-07T20:31:47.7273013Z     
2025-05-07T20:31:47.7273189Z         if contiguous:
2025-05-07T20:31:47.7273415Z             x0 = x0.contiguous()
2025-05-07T20:31:47.7273665Z             x1 = x1.contiguous()
2025-05-07T20:31:47.7273898Z     
2025-05-07T20:31:47.7274080Z         if scale_ub is not None:
2025-05-07T20:31:47.7274342Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:47.7274670Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:47.7274972Z             )
2025-05-07T20:31:47.7275246Z         else:
2025-05-07T20:31:47.7275450Z             scale_ub_tensor = None
2025-05-07T20:31:47.7275702Z     
2025-05-07T20:31:47.7275931Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:47.7276237Z             op = silu_mul_quant
2025-05-07T20:31:47.7276481Z             if compiled:
2025-05-07T20:31:47.7276795Z                 op = torch.compile(op)
2025-05-07T20:31:47.7277081Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.7277352Z     
2025-05-07T20:31:47.7277538Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:47.7277701Z 
2025-05-07T20:31:47.7277801Z moe/activation_test.py:117: 
2025-05-07T20:31:47.7278087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.7278411Z moe/activation_test.py:115: in fn
2025-05-07T20:31:47.7278686Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.7279372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:47.7280057Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:47.7280589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:47.7281265Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:47.7281918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:47.7282443Z     kernel = self.compile(
2025-05-07T20:31:47.7283025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:47.7283666Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:47.7284056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.7284282Z 
2025-05-07T20:31:47.7284491Z self = <triton.compiler.compiler.ASTSource object at 0x7faa024176d0>
2025-05-07T20:31:47.7285569Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:47.7286944Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02697e50>}
2025-05-07T20:31:47.7288280Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:47.7289300Z context = <triton._C.libtriton.ir.context object at 0x7faa0235d730>
2025-05-07T20:31:47.7289590Z 
2025-05-07T20:31:47.7289754Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:47.7290280Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:47.7290736Z                            module_map=module_map)
2025-05-07T20:31:47.7291093Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:47.7291446Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:47.7291698Z E       ^
2025-05-07T20:31:47.7292157Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:47.7292611Z 
2025-05-07T20:31:47.7293067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:47.7293579Z 
2025-05-07T20:31:47.7293687Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.7294096Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.7294501Z     T=4096,
2025-05-07T20:31:47.7294679Z     D=7168,
2025-05-07T20:31:47.7294859Z     scale_ub=1200.0,
2025-05-07T20:31:47.7295165Z     contiguous=False,
2025-05-07T20:31:47.7295385Z     compiled=True,
2025-05-07T20:31:47.7295583Z )
2025-05-07T20:31:47.8467038Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.8467644Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:47.8468164Z 
2025-05-07T20:31:47.8468244Z     @given(
2025-05-07T20:31:47.8468476Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.8468786Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.8469093Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.8469422Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.8469747Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.8470118Z     )
2025-05-07T20:31:47.8470469Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.8470902Z     def test_silu_mul_quant(
2025-05-07T20:31:47.8471150Z         self,
2025-05-07T20:31:47.8471351Z         T: int,
2025-05-07T20:31:47.8471549Z         D: int,
2025-05-07T20:31:47.8471765Z         scale_ub: Optional[float],
2025-05-07T20:31:47.8472035Z         contiguous: bool,
2025-05-07T20:31:47.8472274Z         compiled: bool,
2025-05-07T20:31:47.8472494Z     ) -> None:
2025-05-07T20:31:47.8472722Z         torch.manual_seed(2025)
2025-05-07T20:31:47.8472967Z     
2025-05-07T20:31:47.8473236Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.8473579Z     
2025-05-07T20:31:47.8473775Z         x_sign = torch.sign(x)
2025-05-07T20:31:47.8474064Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:47.8474374Z         x = x_sign * x_clamp
2025-05-07T20:31:47.8474614Z         x0 = x[:, :D]
2025-05-07T20:31:47.8474827Z         x1 = x[:, D:]
2025-05-07T20:31:47.8475032Z     
2025-05-07T20:31:47.8475217Z         if contiguous:
2025-05-07T20:31:47.8475444Z             x0 = x0.contiguous()
2025-05-07T20:31:47.8475716Z             x1 = x1.contiguous()
2025-05-07T20:31:47.8475960Z     
2025-05-07T20:31:47.8476149Z         if scale_ub is not None:
2025-05-07T20:31:47.8476426Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:47.8476759Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:47.8477073Z             )
2025-05-07T20:31:47.8477265Z         else:
2025-05-07T20:31:47.8477482Z             scale_ub_tensor = None
2025-05-07T20:31:47.8477737Z     
2025-05-07T20:31:47.8477965Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:47.8478283Z             op = silu_mul_quant
2025-05-07T20:31:47.8478532Z             if compiled:
2025-05-07T20:31:47.8478776Z                 op = torch.compile(op)
2025-05-07T20:31:47.8479077Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.8479356Z     
2025-05-07T20:31:47.8479547Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:47.8479720Z 
2025-05-07T20:31:47.8479824Z moe/activation_test.py:117: 
2025-05-07T20:31:47.8480125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.8480457Z moe/activation_test.py:115: in fn
2025-05-07T20:31:47.8480739Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.8481293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:47.8481853Z     return fn(*args, **kwargs)
2025-05-07T20:31:47.8482511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:47.8483249Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:47.8483780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:47.8484459Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:47.8485247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:47.8485779Z     kernel = self.compile(
2025-05-07T20:31:47.8486318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:47.8486960Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:47.8487449Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.8487680Z 
2025-05-07T20:31:47.8487888Z self = <triton.compiler.compiler.ASTSource object at 0x7faa023641c0>
2025-05-07T20:31:47.8488969Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:47.8490340Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa0234eee0>}
2025-05-07T20:31:47.8491683Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:47.8492705Z context = <triton._C.libtriton.ir.context object at 0x7faa02450bf0>
2025-05-07T20:31:47.8493009Z 
2025-05-07T20:31:47.8493202Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:47.8493737Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:47.8494201Z                            module_map=module_map)
2025-05-07T20:31:47.8494561Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:47.8494915Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:47.8495161Z E       ^
2025-05-07T20:31:47.8495620Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:47.8496067Z 
2025-05-07T20:31:47.8496483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:47.8496994Z 
2025-05-07T20:31:47.8497099Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.8497499Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.8497914Z     T=128,
2025-05-07T20:31:47.8498086Z     D=7168,
2025-05-07T20:31:47.8498265Z     scale_ub=1200.0,
2025-05-07T20:31:47.8498491Z     contiguous=False,
2025-05-07T20:31:47.8498710Z     compiled=True,
2025-05-07T20:31:47.8498899Z )
2025-05-07T20:31:47.8499213Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.8499696Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:47.8499963Z 
2025-05-07T20:31:47.8500040Z     @given(
2025-05-07T20:31:47.8500252Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.8500555Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.8500851Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.8501168Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.8501493Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.8501774Z     )
2025-05-07T20:31:47.8502111Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.8502542Z     def test_silu_mul_quant(
2025-05-07T20:31:47.8502774Z         self,
2025-05-07T20:31:47.8502956Z         T: int,
2025-05-07T20:31:47.8503144Z         D: int,
2025-05-07T20:31:47.8503372Z         scale_ub: Optional[float],
2025-05-07T20:31:47.8503632Z         contiguous: bool,
2025-05-07T20:31:47.8504028Z         compiled: bool,
2025-05-07T20:31:47.8504244Z     ) -> None:
2025-05-07T20:31:47.8504447Z         torch.manual_seed(2025)
2025-05-07T20:31:47.8504677Z     
2025-05-07T20:31:47.8505066Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.8505403Z     
2025-05-07T20:31:47.8505585Z         x_sign = torch.sign(x)
2025-05-07T20:31:47.8505862Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:47.8506165Z         x = x_sign * x_clamp
2025-05-07T20:31:47.8506398Z         x0 = x[:, :D]
2025-05-07T20:31:47.8506717Z         x1 = x[:, D:]
2025-05-07T20:31:47.8506921Z     
2025-05-07T20:31:47.8507095Z         if contiguous:
2025-05-07T20:31:47.8507310Z             x0 = x0.contiguous()
2025-05-07T20:31:47.8507557Z             x1 = x1.contiguous()
2025-05-07T20:31:47.8507794Z     
2025-05-07T20:31:47.8507977Z         if scale_ub is not None:
2025-05-07T20:31:47.8508238Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:47.8508559Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:47.8508864Z             )
2025-05-07T20:31:47.8509045Z         else:
2025-05-07T20:31:47.8509243Z             scale_ub_tensor = None
2025-05-07T20:31:47.8509489Z     
2025-05-07T20:31:47.8509721Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:47.8510084Z             op = silu_mul_quant
2025-05-07T20:31:47.8510327Z             if compiled:
2025-05-07T20:31:47.8510565Z                 op = torch.compile(op)
2025-05-07T20:31:47.8510854Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.8511128Z     
2025-05-07T20:31:47.8511311Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:47.8511471Z 
2025-05-07T20:31:47.8511573Z moe/activation_test.py:117: 
2025-05-07T20:31:47.8511853Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.8512176Z moe/activation_test.py:115: in fn
2025-05-07T20:31:47.8512450Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.8513038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:47.8513581Z     return fn(*args, **kwargs)
2025-05-07T20:31:47.8514229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:47.8514904Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:47.8515425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:47.8516097Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:47.8516746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:47.8517266Z     kernel = self.compile(
2025-05-07T20:31:47.8517787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:47.8518426Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:47.8518814Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.8519037Z 
2025-05-07T20:31:47.8519240Z self = <triton.compiler.compiler.ASTSource object at 0x7faa0247f4f0>
2025-05-07T20:31:47.8520310Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:47.8521674Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa026c8af0>}
2025-05-07T20:31:47.8523038Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:47.8524066Z context = <triton._C.libtriton.ir.context object at 0x7faa02286130>
2025-05-07T20:31:47.8524347Z 
2025-05-07T20:31:47.8524513Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:47.8525110Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:47.8525569Z                            module_map=module_map)
2025-05-07T20:31:47.8525918Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:47.8526261Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:47.8526586Z E       ^
2025-05-07T20:31:47.8527049Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:47.8527493Z 
2025-05-07T20:31:47.8527902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:47.8528408Z 
2025-05-07T20:31:48.0256010Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.0256438Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.0256845Z     T=2048,
2025-05-07T20:31:48.0257044Z     D=7168,
2025-05-07T20:31:48.0257251Z     scale_ub=None,
2025-05-07T20:31:48.0257487Z     contiguous=True,
2025-05-07T20:31:48.0257712Z     compiled=True,
2025-05-07T20:31:48.0257910Z )
2025-05-07T20:31:48.0258288Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.0258902Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:48.0259177Z 
2025-05-07T20:31:48.0259258Z     @given(
2025-05-07T20:31:48.0259484Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.0259801Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.0260108Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.0260431Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.0260763Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.0261048Z     )
2025-05-07T20:31:48.0261391Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.0261841Z     def test_silu_mul_quant(
2025-05-07T20:31:48.0262081Z         self,
2025-05-07T20:31:48.0262274Z         T: int,
2025-05-07T20:31:48.0262475Z         D: int,
2025-05-07T20:31:48.0262694Z         scale_ub: Optional[float],
2025-05-07T20:31:48.0262977Z         contiguous: bool,
2025-05-07T20:31:48.0263248Z         compiled: bool,
2025-05-07T20:31:48.0263476Z     ) -> None:
2025-05-07T20:31:48.0263692Z         torch.manual_seed(2025)
2025-05-07T20:31:48.0263928Z     
2025-05-07T20:31:48.0264195Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.0264535Z     
2025-05-07T20:31:48.0264723Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.0265009Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.0265316Z         x = x_sign * x_clamp
2025-05-07T20:31:48.0265548Z         x0 = x[:, :D]
2025-05-07T20:31:48.0265765Z         x1 = x[:, D:]
2025-05-07T20:31:48.0265973Z     
2025-05-07T20:31:48.0266154Z         if contiguous:
2025-05-07T20:31:48.0266393Z             x0 = x0.contiguous()
2025-05-07T20:31:48.0266656Z             x1 = x1.contiguous()
2025-05-07T20:31:48.0266892Z     
2025-05-07T20:31:48.0267084Z         if scale_ub is not None:
2025-05-07T20:31:48.0267359Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.0267685Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.0267997Z             )
2025-05-07T20:31:48.0268197Z         else:
2025-05-07T20:31:48.0268411Z             scale_ub_tensor = None
2025-05-07T20:31:48.0268655Z     
2025-05-07T20:31:48.0268884Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.0269198Z             op = silu_mul_quant
2025-05-07T20:31:48.0269443Z             if compiled:
2025-05-07T20:31:48.0269694Z                 op = torch.compile(op)
2025-05-07T20:31:48.0270062Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.0270333Z     
2025-05-07T20:31:48.0270521Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.0270684Z 
2025-05-07T20:31:48.0270950Z moe/activation_test.py:117: 
2025-05-07T20:31:48.0271247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.0271572Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.0271849Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.0272401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.0273058Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.0273706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.0274384Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.0274905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.0275576Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.0276234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.0276758Z     kernel = self.compile(
2025-05-07T20:31:48.0277285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.0277933Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.0278320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.0278544Z 
2025-05-07T20:31:48.0278750Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02138130>
2025-05-07T20:31:48.0279820Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.0281192Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa021238b0>}
2025-05-07T20:31:48.0282529Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.0283546Z context = <triton._C.libtriton.ir.context object at 0x7faa02304a70>
2025-05-07T20:31:48.0283837Z 
2025-05-07T20:31:48.0283999Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.0284518Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.0284977Z                            module_map=module_map)
2025-05-07T20:31:48.0291276Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.0291678Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.0291942Z E       ^
2025-05-07T20:31:48.0292423Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.0292892Z 
2025-05-07T20:31:48.0293350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.0293877Z 
2025-05-07T20:31:48.0293981Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.0294399Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.0294805Z     T=16384,
2025-05-07T20:31:48.0294994Z     D=5120,
2025-05-07T20:31:48.0295186Z     scale_ub=None,
2025-05-07T20:31:48.0295403Z     contiguous=False,
2025-05-07T20:31:48.0295625Z     compiled=False,
2025-05-07T20:31:48.0295828Z )
2025-05-07T20:31:48.0296140Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.0296633Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:48.0296916Z 
2025-05-07T20:31:48.0296994Z     @given(
2025-05-07T20:31:48.0297327Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.0297637Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.0297935Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.0298260Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.0298592Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.0298950Z     )
2025-05-07T20:31:48.0299297Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.0299738Z     def test_silu_mul_quant(
2025-05-07T20:31:48.0299973Z         self,
2025-05-07T20:31:48.0300161Z         T: int,
2025-05-07T20:31:48.0300358Z         D: int,
2025-05-07T20:31:48.0300566Z         scale_ub: Optional[float],
2025-05-07T20:31:48.0300834Z         contiguous: bool,
2025-05-07T20:31:48.0301068Z         compiled: bool,
2025-05-07T20:31:48.0301296Z     ) -> None:
2025-05-07T20:31:48.0301503Z         torch.manual_seed(2025)
2025-05-07T20:31:48.0301744Z     
2025-05-07T20:31:48.0302018Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.0302349Z     
2025-05-07T20:31:48.0302540Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.0302823Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.0305045Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.0306939Z 
2025-05-07T20:31:48.0307060Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:48.0307271Z 
2025-05-07T20:31:48.0307380Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.0307788Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.0308180Z     T=4096,
2025-05-07T20:31:48.0308359Z     D=7168,
2025-05-07T20:31:48.0308539Z     scale_ub=1200.0,
2025-05-07T20:31:48.0308758Z     contiguous=True,
2025-05-07T20:31:48.0308981Z     compiled=True,
2025-05-07T20:31:48.0309173Z )
2025-05-07T20:31:48.0309486Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.0310042Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:48.0310359Z 
2025-05-07T20:31:48.0310435Z     @given(
2025-05-07T20:31:48.0310680Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.0311034Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.0311367Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.0311734Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.0312101Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.0312417Z     )
2025-05-07T20:31:48.0312854Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.0313373Z     def test_silu_mul_quant(
2025-05-07T20:31:48.0313642Z         self,
2025-05-07T20:31:48.0313844Z         T: int,
2025-05-07T20:31:48.0314048Z         D: int,
2025-05-07T20:31:48.0314274Z         scale_ub: Optional[float],
2025-05-07T20:31:48.0314563Z         contiguous: bool,
2025-05-07T20:31:48.0314819Z         compiled: bool,
2025-05-07T20:31:48.0315056Z     ) -> None:
2025-05-07T20:31:48.0315285Z         torch.manual_seed(2025)
2025-05-07T20:31:48.0315542Z     
2025-05-07T20:31:48.0315831Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.0316219Z     
2025-05-07T20:31:48.0316416Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.0316732Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.0319456Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.0321445Z 
2025-05-07T20:31:48.0321568Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:48.0321776Z 
2025-05-07T20:31:48.0321881Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.0322283Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.0322680Z     T=16384,
2025-05-07T20:31:48.0322894Z     D=7168,
2025-05-07T20:31:48.0323101Z     scale_ub=None,
2025-05-07T20:31:48.0323322Z     contiguous=False,
2025-05-07T20:31:48.0323540Z     compiled=False,
2025-05-07T20:31:48.0323733Z )
2025-05-07T20:31:48.1362547Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.1363085Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:48.1363370Z 
2025-05-07T20:31:48.1363452Z     @given(
2025-05-07T20:31:48.1363671Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.1363981Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.1364285Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.1364607Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.1364928Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.1365217Z     )
2025-05-07T20:31:48.1365562Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.1365999Z     def test_silu_mul_quant(
2025-05-07T20:31:48.1366250Z         self,
2025-05-07T20:31:48.1366445Z         T: int,
2025-05-07T20:31:48.1366636Z         D: int,
2025-05-07T20:31:48.1366855Z         scale_ub: Optional[float],
2025-05-07T20:31:48.1367125Z         contiguous: bool,
2025-05-07T20:31:48.1367358Z         compiled: bool,
2025-05-07T20:31:48.1367581Z     ) -> None:
2025-05-07T20:31:48.1367810Z         torch.manual_seed(2025)
2025-05-07T20:31:48.1368048Z     
2025-05-07T20:31:48.1368317Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.1370393Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.1372286Z 
2025-05-07T20:31:48.1372406Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.1372618Z 
2025-05-07T20:31:48.1372730Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.1373141Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.1373542Z     T=2048,
2025-05-07T20:31:48.1373727Z     D=7168,
2025-05-07T20:31:48.1373915Z     scale_ub=1200.0,
2025-05-07T20:31:48.1374138Z     contiguous=True,
2025-05-07T20:31:48.1374362Z     compiled=True,
2025-05-07T20:31:48.1374558Z )
2025-05-07T20:31:48.1374874Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.1375362Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:48.1375630Z 
2025-05-07T20:31:48.1375717Z     @given(
2025-05-07T20:31:48.1376081Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.1376396Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.1376704Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.1377024Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.1377355Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.1377751Z     )
2025-05-07T20:31:48.1378096Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.1378533Z     def test_silu_mul_quant(
2025-05-07T20:31:48.1378775Z         self,
2025-05-07T20:31:48.1378964Z         T: int,
2025-05-07T20:31:48.1379157Z         D: int,
2025-05-07T20:31:48.1379374Z         scale_ub: Optional[float],
2025-05-07T20:31:48.1379643Z         contiguous: bool,
2025-05-07T20:31:48.1379876Z         compiled: bool,
2025-05-07T20:31:48.1380099Z     ) -> None:
2025-05-07T20:31:48.1380313Z         torch.manual_seed(2025)
2025-05-07T20:31:48.1380548Z     
2025-05-07T20:31:48.1380826Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.1381160Z     
2025-05-07T20:31:48.1381346Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.1381635Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.1383637Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.1385509Z 
2025-05-07T20:31:48.1385630Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:48.1385843Z 
2025-05-07T20:31:48.1385953Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.1386360Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.1386761Z     T=2048,
2025-05-07T20:31:48.1386944Z     D=7168,
2025-05-07T20:31:48.1387131Z     scale_ub=None,
2025-05-07T20:31:48.1387343Z     contiguous=True,
2025-05-07T20:31:48.1387574Z     compiled=False,
2025-05-07T20:31:48.1387777Z )
2025-05-07T20:31:48.1388090Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.1388576Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:48.1388842Z 
2025-05-07T20:31:48.1388924Z     @given(
2025-05-07T20:31:48.1389144Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.1389456Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.1389761Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.1390154Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.1390490Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.1390775Z     )
2025-05-07T20:31:48.1391115Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.1391552Z     def test_silu_mul_quant(
2025-05-07T20:31:48.1391789Z         self,
2025-05-07T20:31:48.1391976Z         T: int,
2025-05-07T20:31:48.1392177Z         D: int,
2025-05-07T20:31:48.1392391Z         scale_ub: Optional[float],
2025-05-07T20:31:48.1392654Z         contiguous: bool,
2025-05-07T20:31:48.1392928Z         compiled: bool,
2025-05-07T20:31:48.1393162Z     ) -> None:
2025-05-07T20:31:48.1393372Z         torch.manual_seed(2025)
2025-05-07T20:31:48.1393609Z     
2025-05-07T20:31:48.1393884Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.1394221Z     
2025-05-07T20:31:48.1394409Z >       x_sign = torch.sign(x)
2025-05-07T20:31:48.1396446Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.1398399Z 
2025-05-07T20:31:48.1398516Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:48.1398730Z 
2025-05-07T20:31:48.1398834Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.1399242Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.1399637Z     T=1,
2025-05-07T20:31:48.1399822Z     D=7168,
2025-05-07T20:31:48.1400017Z     scale_ub=1200.0,
2025-05-07T20:31:48.1400233Z     contiguous=True,
2025-05-07T20:31:48.1400454Z     compiled=False,
2025-05-07T20:31:48.1400665Z )
2025-05-07T20:31:48.2967477Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.2968761Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:48.2969286Z 
2025-05-07T20:31:48.2969438Z     @given(
2025-05-07T20:31:48.2969874Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.2970480Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.2971063Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.2971714Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.2972347Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.2972795Z     )
2025-05-07T20:31:48.2973137Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.2973571Z     def test_silu_mul_quant(
2025-05-07T20:31:48.2973809Z         self,
2025-05-07T20:31:48.2973994Z         T: int,
2025-05-07T20:31:48.2974182Z         D: int,
2025-05-07T20:31:48.2974401Z         scale_ub: Optional[float],
2025-05-07T20:31:48.2974662Z         contiguous: bool,
2025-05-07T20:31:48.2974896Z         compiled: bool,
2025-05-07T20:31:48.2975112Z     ) -> None:
2025-05-07T20:31:48.2975318Z         torch.manual_seed(2025)
2025-05-07T20:31:48.2975555Z     
2025-05-07T20:31:48.2975823Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.2976152Z     
2025-05-07T20:31:48.2976345Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.2976627Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.2976936Z         x = x_sign * x_clamp
2025-05-07T20:31:48.2977167Z         x0 = x[:, :D]
2025-05-07T20:31:48.2977376Z         x1 = x[:, D:]
2025-05-07T20:31:48.2977572Z     
2025-05-07T20:31:48.2977749Z         if contiguous:
2025-05-07T20:31:48.2977976Z             x0 = x0.contiguous()
2025-05-07T20:31:48.2978230Z             x1 = x1.contiguous()
2025-05-07T20:31:48.2978462Z     
2025-05-07T20:31:48.2978654Z         if scale_ub is not None:
2025-05-07T20:31:48.2978920Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.2979248Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.2979550Z             )
2025-05-07T20:31:48.2979737Z         else:
2025-05-07T20:31:48.2979938Z             scale_ub_tensor = None
2025-05-07T20:31:48.2980188Z     
2025-05-07T20:31:48.2980415Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.2980717Z             op = silu_mul_quant
2025-05-07T20:31:48.2980965Z             if compiled:
2025-05-07T20:31:48.2981204Z                 op = torch.compile(op)
2025-05-07T20:31:48.2981490Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.2981761Z     
2025-05-07T20:31:48.2981945Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.2982109Z 
2025-05-07T20:31:48.2982210Z moe/activation_test.py:117: 
2025-05-07T20:31:48.2982499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.2983029Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.2983326Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.2984023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.2984708Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.2985354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.2986031Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.2986680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.2987205Z     kernel = self.compile(
2025-05-07T20:31:48.2987740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.2988387Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.2988777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.2989004Z 
2025-05-07T20:31:48.2989207Z self = <triton.compiler.compiler.ASTSource object at 0x7faa021e30a0>
2025-05-07T20:31:48.2990366Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.2991744Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa021dc550>}
2025-05-07T20:31:48.2993096Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.2994140Z context = <triton._C.libtriton.ir.context object at 0x7faa021c4670>
2025-05-07T20:31:48.2994427Z 
2025-05-07T20:31:48.2994590Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.2995101Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.2995571Z                            module_map=module_map)
2025-05-07T20:31:48.2995931Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.2996273Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.2996522Z E       ^
2025-05-07T20:31:48.2996982Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.2997427Z 
2025-05-07T20:31:48.2997848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.2998356Z 
2025-05-07T20:31:48.2998457Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.2998870Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.2999264Z     T=128,
2025-05-07T20:31:48.2999444Z     D=5120,
2025-05-07T20:31:48.2999629Z     scale_ub=None,
2025-05-07T20:31:48.2999841Z     contiguous=True,
2025-05-07T20:31:48.3000055Z     compiled=False,
2025-05-07T20:31:48.3000258Z )
2025-05-07T20:31:48.3000574Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.3001050Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:48.3001319Z 
2025-05-07T20:31:48.3001390Z     @given(
2025-05-07T20:31:48.3001612Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.3001914Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.3002208Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.3002531Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.3002853Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.3003210Z     )
2025-05-07T20:31:48.3003554Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.3004171Z     def test_silu_mul_quant(
2025-05-07T20:31:48.3004407Z         self,
2025-05-07T20:31:48.3004594Z         T: int,
2025-05-07T20:31:48.3004782Z         D: int,
2025-05-07T20:31:48.3005121Z         scale_ub: Optional[float],
2025-05-07T20:31:48.3005381Z         contiguous: bool,
2025-05-07T20:31:48.3005610Z         compiled: bool,
2025-05-07T20:31:48.3005824Z     ) -> None:
2025-05-07T20:31:48.3006029Z         torch.manual_seed(2025)
2025-05-07T20:31:48.3006259Z     
2025-05-07T20:31:48.3006525Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.3006851Z     
2025-05-07T20:31:48.3007045Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.3007334Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.3007631Z         x = x_sign * x_clamp
2025-05-07T20:31:48.3007862Z         x0 = x[:, :D]
2025-05-07T20:31:48.3008079Z         x1 = x[:, D:]
2025-05-07T20:31:48.3008277Z     
2025-05-07T20:31:48.3008449Z         if contiguous:
2025-05-07T20:31:48.3008670Z             x0 = x0.contiguous()
2025-05-07T20:31:48.3008917Z             x1 = x1.contiguous()
2025-05-07T20:31:48.3009151Z     
2025-05-07T20:31:48.3009335Z         if scale_ub is not None:
2025-05-07T20:31:48.3009599Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.3009929Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.3010234Z             )
2025-05-07T20:31:48.3010419Z         else:
2025-05-07T20:31:48.3010623Z             scale_ub_tensor = None
2025-05-07T20:31:48.3010865Z     
2025-05-07T20:31:48.3011087Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.3011394Z             op = silu_mul_quant
2025-05-07T20:31:48.3011639Z             if compiled:
2025-05-07T20:31:48.3011882Z                 op = torch.compile(op)
2025-05-07T20:31:48.3012173Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.3012436Z     
2025-05-07T20:31:48.3012624Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.3012804Z 
2025-05-07T20:31:48.3012913Z moe/activation_test.py:117: 
2025-05-07T20:31:48.3013220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.3013547Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.3013816Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.3014501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.3015181Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.3015704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.3016370Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.3017020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.3017536Z     kernel = self.compile(
2025-05-07T20:31:48.3018064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.3018709Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.3019100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.3019327Z 
2025-05-07T20:31:48.3019533Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02078ac0>
2025-05-07T20:31:48.3020605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.3022090Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02227040>}
2025-05-07T20:31:48.3023482Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.3024492Z context = <triton._C.libtriton.ir.context object at 0x7faa02228c70>
2025-05-07T20:31:48.3024849Z 
2025-05-07T20:31:48.3025016Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.3025523Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.3025983Z                            module_map=module_map)
2025-05-07T20:31:48.3026340Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.3026681Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.3026932Z E       ^
2025-05-07T20:31:48.3027400Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.3027844Z 
2025-05-07T20:31:48.3028259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.3028762Z 
2025-05-07T20:31:48.3028860Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.3029266Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.3029664Z     T=128,
2025-05-07T20:31:48.3029887Z     D=7168,
2025-05-07T20:31:48.3030070Z     scale_ub=None,
2025-05-07T20:31:48.3030273Z     contiguous=True,
2025-05-07T20:31:48.3030485Z     compiled=False,
2025-05-07T20:31:48.3030688Z )
2025-05-07T20:31:48.3922533Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.3923049Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:48.3923359Z 
2025-05-07T20:31:48.3923447Z     @given(
2025-05-07T20:31:48.3923674Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.3923979Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.3924279Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.3924600Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.3924919Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.3925203Z     )
2025-05-07T20:31:48.3925540Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.3925969Z     def test_silu_mul_quant(
2025-05-07T20:31:48.3926208Z         self,
2025-05-07T20:31:48.3926391Z         T: int,
2025-05-07T20:31:48.3926582Z         D: int,
2025-05-07T20:31:48.3926794Z         scale_ub: Optional[float],
2025-05-07T20:31:48.3927060Z         contiguous: bool,
2025-05-07T20:31:48.3927284Z         compiled: bool,
2025-05-07T20:31:48.3927502Z     ) -> None:
2025-05-07T20:31:48.3927708Z         torch.manual_seed(2025)
2025-05-07T20:31:48.3927937Z     
2025-05-07T20:31:48.3928207Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.3928541Z     
2025-05-07T20:31:48.3928722Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.3929006Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.3929306Z         x = x_sign * x_clamp
2025-05-07T20:31:48.3929533Z         x0 = x[:, :D]
2025-05-07T20:31:48.3929749Z         x1 = x[:, D:]
2025-05-07T20:31:48.3929969Z     
2025-05-07T20:31:48.3930145Z         if contiguous:
2025-05-07T20:31:48.3930363Z             x0 = x0.contiguous()
2025-05-07T20:31:48.3930620Z             x1 = x1.contiguous()
2025-05-07T20:31:48.3930853Z     
2025-05-07T20:31:48.3931034Z         if scale_ub is not None:
2025-05-07T20:31:48.3931294Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.3938082Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.3938440Z             )
2025-05-07T20:31:48.3938626Z         else:
2025-05-07T20:31:48.3938834Z             scale_ub_tensor = None
2025-05-07T20:31:48.3939258Z     
2025-05-07T20:31:48.3939490Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.3939804Z             op = silu_mul_quant
2025-05-07T20:31:48.3940061Z             if compiled:
2025-05-07T20:31:48.3940305Z                 op = torch.compile(op)
2025-05-07T20:31:48.3940607Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.3940992Z     
2025-05-07T20:31:48.3941177Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.3941346Z 
2025-05-07T20:31:48.3941446Z moe/activation_test.py:117: 
2025-05-07T20:31:48.3941740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.3942072Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.3942346Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.3943109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.3943814Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.3944366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.3945057Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.3945731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.3946274Z     kernel = self.compile(
2025-05-07T20:31:48.3946819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.3947482Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.3947884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.3948112Z 
2025-05-07T20:31:48.3948327Z self = <triton.compiler.compiler.ASTSource object at 0x7faa021c7970>
2025-05-07T20:31:48.3949436Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.3950926Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02227c10>}
2025-05-07T20:31:48.3952301Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.3953390Z context = <triton._C.libtriton.ir.context object at 0x7faa020944b0>
2025-05-07T20:31:48.3953683Z 
2025-05-07T20:31:48.3953854Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.3954379Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.3954852Z                            module_map=module_map)
2025-05-07T20:31:48.3955215Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.3955560Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.3955818Z E       ^
2025-05-07T20:31:48.3956286Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.3956748Z 
2025-05-07T20:31:48.3957179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.3957702Z 
2025-05-07T20:31:48.3957801Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.3958226Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.3958634Z     T=2048,
2025-05-07T20:31:48.3958811Z     D=7168,
2025-05-07T20:31:48.3958993Z     scale_ub=1200.0,
2025-05-07T20:31:48.3959205Z     contiguous=True,
2025-05-07T20:31:48.3959419Z     compiled=False,
2025-05-07T20:31:48.3959621Z )
2025-05-07T20:31:48.3960028Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.3960524Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:48.3960810Z 
2025-05-07T20:31:48.3960884Z     @given(
2025-05-07T20:31:48.3961111Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.3961522Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.3961831Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.3962164Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.3962494Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.3962774Z     )
2025-05-07T20:31:48.3963120Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.3963562Z     def test_silu_mul_quant(
2025-05-07T20:31:48.3963796Z         self,
2025-05-07T20:31:48.3963986Z         T: int,
2025-05-07T20:31:48.3964174Z         D: int,
2025-05-07T20:31:48.3964394Z         scale_ub: Optional[float],
2025-05-07T20:31:48.3964663Z         contiguous: bool,
2025-05-07T20:31:48.3964899Z         compiled: bool,
2025-05-07T20:31:48.3965125Z     ) -> None:
2025-05-07T20:31:48.3965330Z         torch.manual_seed(2025)
2025-05-07T20:31:48.3965563Z     
2025-05-07T20:31:48.3965831Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.3967953Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.3969873Z 
2025-05-07T20:31:48.3969998Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.3970211Z 
2025-05-07T20:31:48.3970312Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.3970724Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.3971126Z     T=1,
2025-05-07T20:31:48.3971301Z     D=5120,
2025-05-07T20:31:48.3971487Z     scale_ub=1200.0,
2025-05-07T20:31:48.3971703Z     contiguous=True,
2025-05-07T20:31:48.3971918Z     compiled=False,
2025-05-07T20:31:48.3972112Z )
2025-05-07T20:31:48.4455902Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.4456409Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:48.4456704Z 
2025-05-07T20:31:48.4456799Z     @given(
2025-05-07T20:31:48.4457125Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.4457548Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.4457903Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.4458236Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.4458562Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.4458841Z     )
2025-05-07T20:31:48.4459190Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.4459638Z     def test_silu_mul_quant(
2025-05-07T20:31:48.4459875Z         self,
2025-05-07T20:31:48.4460067Z         T: int,
2025-05-07T20:31:48.4460271Z         D: int,
2025-05-07T20:31:48.4460485Z         scale_ub: Optional[float],
2025-05-07T20:31:48.4460765Z         contiguous: bool,
2025-05-07T20:31:48.4461004Z         compiled: bool,
2025-05-07T20:31:48.4461223Z     ) -> None:
2025-05-07T20:31:48.4461438Z         torch.manual_seed(2025)
2025-05-07T20:31:48.4461682Z     
2025-05-07T20:31:48.4461951Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.4462290Z     
2025-05-07T20:31:48.4462642Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.4462941Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.4463284Z         x = x_sign * x_clamp
2025-05-07T20:31:48.4463536Z         x0 = x[:, :D]
2025-05-07T20:31:48.4463754Z         x1 = x[:, D:]
2025-05-07T20:31:48.4463961Z     
2025-05-07T20:31:48.4464156Z         if contiguous:
2025-05-07T20:31:48.4464512Z             x0 = x0.contiguous()
2025-05-07T20:31:48.4464770Z             x1 = x1.contiguous()
2025-05-07T20:31:48.4465013Z     
2025-05-07T20:31:48.4465204Z         if scale_ub is not None:
2025-05-07T20:31:48.4465469Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.4465809Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.4466121Z             )
2025-05-07T20:31:48.4466309Z         else:
2025-05-07T20:31:48.4466522Z             scale_ub_tensor = None
2025-05-07T20:31:48.4466771Z     
2025-05-07T20:31:48.4466999Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.4467316Z             op = silu_mul_quant
2025-05-07T20:31:48.4467568Z             if compiled:
2025-05-07T20:31:48.4467811Z                 op = torch.compile(op)
2025-05-07T20:31:48.4468103Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.4468377Z     
2025-05-07T20:31:48.4468573Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.4468744Z 
2025-05-07T20:31:48.4468842Z moe/activation_test.py:117: 
2025-05-07T20:31:48.4469139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.4469478Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.4469756Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.4470526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.4471212Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.4471744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.4472424Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.4473090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.4473619Z     kernel = self.compile(
2025-05-07T20:31:48.4474163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.4474813Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.4475207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.4475433Z 
2025-05-07T20:31:48.4475644Z self = <triton.compiler.compiler.ASTSource object at 0x7faa021935b0>
2025-05-07T20:31:48.4476720Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.4478103Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa021ad9d0>}
2025-05-07T20:31:48.4479448Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.4480471Z context = <triton._C.libtriton.ir.context object at 0x7fa5b1f879b0>
2025-05-07T20:31:48.4480757Z 
2025-05-07T20:31:48.4480928Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.4481446Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.4481910Z                            module_map=module_map)
2025-05-07T20:31:48.4482273Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.4482706Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.4482969Z E       ^
2025-05-07T20:31:48.4483436Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.4483885Z 
2025-05-07T20:31:48.4484302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.4484910Z 
2025-05-07T20:31:48.4485014Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.4485425Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.4485827Z     T=2048,
2025-05-07T20:31:48.4486013Z     D=5120,
2025-05-07T20:31:48.4486203Z     scale_ub=None,
2025-05-07T20:31:48.4486416Z     contiguous=True,
2025-05-07T20:31:48.4486661Z     compiled=False,
2025-05-07T20:31:48.4486864Z )
2025-05-07T20:31:48.4487178Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.4487675Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:48.4487944Z 
2025-05-07T20:31:48.4488034Z     @given(
2025-05-07T20:31:48.4488257Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.4488569Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.4488873Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.4489208Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.4489548Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.4489829Z     )
2025-05-07T20:31:48.4490173Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.4490619Z     def test_silu_mul_quant(
2025-05-07T20:31:48.4490860Z         self,
2025-05-07T20:31:48.4491055Z         T: int,
2025-05-07T20:31:48.4491250Z         D: int,
2025-05-07T20:31:48.4491465Z         scale_ub: Optional[float],
2025-05-07T20:31:48.4491734Z         contiguous: bool,
2025-05-07T20:31:48.4491974Z         compiled: bool,
2025-05-07T20:31:48.4492198Z     ) -> None:
2025-05-07T20:31:48.4492417Z         torch.manual_seed(2025)
2025-05-07T20:31:48.4492650Z     
2025-05-07T20:31:48.4492922Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.4493261Z     
2025-05-07T20:31:48.4493453Z >       x_sign = torch.sign(x)
2025-05-07T20:31:48.4495416Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.4497280Z 
2025-05-07T20:31:48.4497406Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:48.4497621Z 
2025-05-07T20:31:48.4497723Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.4498152Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.4498551Z     T=16384,
2025-05-07T20:31:48.4498744Z     D=5120,
2025-05-07T20:31:48.4498945Z     scale_ub=None,
2025-05-07T20:31:48.4499155Z     contiguous=True,
2025-05-07T20:31:48.4499378Z     compiled=False,
2025-05-07T20:31:48.4499580Z )
2025-05-07T20:31:48.4499895Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.4500386Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:48.4500660Z 
2025-05-07T20:31:48.4500746Z     @given(
2025-05-07T20:31:48.4500974Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.4501278Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.4501585Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.4502002Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.4502324Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.4502607Z     )
2025-05-07T20:31:48.4502954Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.4503387Z     def test_silu_mul_quant(
2025-05-07T20:31:48.4503878Z         self,
2025-05-07T20:31:48.4504074Z         T: int,
2025-05-07T20:31:48.4504264Z         D: int,
2025-05-07T20:31:48.4504479Z         scale_ub: Optional[float],
2025-05-07T20:31:48.4504746Z         contiguous: bool,
2025-05-07T20:31:48.4504977Z         compiled: bool,
2025-05-07T20:31:48.4505203Z     ) -> None:
2025-05-07T20:31:48.4505417Z         torch.manual_seed(2025)
2025-05-07T20:31:48.4505653Z     
2025-05-07T20:31:48.4505925Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.4507980Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.4509889Z 
2025-05-07T20:31:48.4510009Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.4510219Z 
2025-05-07T20:31:48.4510325Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.4510732Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.4511129Z     T=4096,
2025-05-07T20:31:48.4511313Z     D=5120,
2025-05-07T20:31:48.4511498Z     scale_ub=None,
2025-05-07T20:31:48.4511715Z     contiguous=True,
2025-05-07T20:31:48.4511936Z     compiled=False,
2025-05-07T20:31:48.4512145Z )
2025-05-07T20:31:48.5550521Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5551075Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:48.5551464Z 
2025-05-07T20:31:48.5551574Z     @given(
2025-05-07T20:31:48.5551888Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5552198Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5552503Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5552830Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5553159Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5553443Z     )
2025-05-07T20:31:48.5553783Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5554230Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5554480Z         self,
2025-05-07T20:31:48.5554670Z         T: int,
2025-05-07T20:31:48.5554874Z         D: int,
2025-05-07T20:31:48.5555093Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5555362Z         contiguous: bool,
2025-05-07T20:31:48.5555606Z         compiled: bool,
2025-05-07T20:31:48.5555830Z     ) -> None:
2025-05-07T20:31:48.5556040Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5556289Z     
2025-05-07T20:31:48.5556564Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5558610Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.5560617Z 
2025-05-07T20:31:48.5560749Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.5560964Z 
2025-05-07T20:31:48.5561065Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5561484Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5562002Z     T=2048,
2025-05-07T20:31:48.5562184Z     D=5120,
2025-05-07T20:31:48.5562374Z     scale_ub=None,
2025-05-07T20:31:48.5562588Z     contiguous=False,
2025-05-07T20:31:48.5562813Z     compiled=False,
2025-05-07T20:31:48.5563022Z )
2025-05-07T20:31:48.5563336Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5563827Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:48.5564109Z 
2025-05-07T20:31:48.5564189Z     @given(
2025-05-07T20:31:48.5564419Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5564741Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5565046Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5565374Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5565699Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5565984Z     )
2025-05-07T20:31:48.5566330Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5566773Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5567012Z         self,
2025-05-07T20:31:48.5567207Z         T: int,
2025-05-07T20:31:48.5567403Z         D: int,
2025-05-07T20:31:48.5567617Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5567889Z         contiguous: bool,
2025-05-07T20:31:48.5568128Z         compiled: bool,
2025-05-07T20:31:48.5568355Z     ) -> None:
2025-05-07T20:31:48.5568567Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5568808Z     
2025-05-07T20:31:48.5569078Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5571135Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.5572992Z 
2025-05-07T20:31:48.5573111Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.5573321Z 
2025-05-07T20:31:48.5573423Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5573839Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5574237Z     T=4096,
2025-05-07T20:31:48.5574421Z     D=7168,
2025-05-07T20:31:48.5574613Z     scale_ub=None,
2025-05-07T20:31:48.5574829Z     contiguous=True,
2025-05-07T20:31:48.5575052Z     compiled=True,
2025-05-07T20:31:48.5575259Z )
2025-05-07T20:31:48.5575575Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5576062Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:48.5576333Z 
2025-05-07T20:31:48.5576410Z     @given(
2025-05-07T20:31:48.5576638Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5576954Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5577252Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5577665Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5578023Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5578296Z     )
2025-05-07T20:31:48.5578635Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5579071Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5579311Z         self,
2025-05-07T20:31:48.5579586Z         T: int,
2025-05-07T20:31:48.5579779Z         D: int,
2025-05-07T20:31:48.5579995Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5580255Z         contiguous: bool,
2025-05-07T20:31:48.5580488Z         compiled: bool,
2025-05-07T20:31:48.5580705Z     ) -> None:
2025-05-07T20:31:48.5580988Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5581227Z     
2025-05-07T20:31:48.5581495Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5583526Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.5585367Z 
2025-05-07T20:31:48.5585487Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.5585692Z 
2025-05-07T20:31:48.5585793Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5586206Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5586603Z     T=2048,
2025-05-07T20:31:48.5586779Z     D=5120,
2025-05-07T20:31:48.5586961Z     scale_ub=1200.0,
2025-05-07T20:31:48.5587186Z     contiguous=False,
2025-05-07T20:31:48.5587402Z     compiled=False,
2025-05-07T20:31:48.5587602Z )
2025-05-07T20:31:48.5587915Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5588400Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:48.5588684Z 
2025-05-07T20:31:48.5588759Z     @given(
2025-05-07T20:31:48.5588980Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5589291Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5589587Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5590013Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5590336Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5590623Z     )
2025-05-07T20:31:48.5590962Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5591395Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5591629Z         self,
2025-05-07T20:31:48.5591821Z         T: int,
2025-05-07T20:31:48.5592017Z         D: int,
2025-05-07T20:31:48.5592223Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5592484Z         contiguous: bool,
2025-05-07T20:31:48.5592718Z         compiled: bool,
2025-05-07T20:31:48.5592930Z     ) -> None:
2025-05-07T20:31:48.5593145Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5593381Z     
2025-05-07T20:31:48.5593653Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5595670Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.5597517Z 
2025-05-07T20:31:48.5597634Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.5597847Z 
2025-05-07T20:31:48.5597948Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5598354Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5598748Z     T=4096,
2025-05-07T20:31:48.5598935Z     D=7168,
2025-05-07T20:31:48.5599233Z     scale_ub=1200.0,
2025-05-07T20:31:48.5599454Z     contiguous=True,
2025-05-07T20:31:48.5599664Z     compiled=False,
2025-05-07T20:31:48.5599859Z )
2025-05-07T20:31:48.5600167Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5600646Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:48.5600996Z 
2025-05-07T20:31:48.5601073Z     @given(
2025-05-07T20:31:48.5601296Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5601598Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5601895Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5602217Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5602539Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5602819Z     )
2025-05-07T20:31:48.5603157Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5603595Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5604096Z         self,
2025-05-07T20:31:48.5604284Z         T: int,
2025-05-07T20:31:48.5604482Z         D: int,
2025-05-07T20:31:48.5604693Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5604958Z         contiguous: bool,
2025-05-07T20:31:48.5605202Z         compiled: bool,
2025-05-07T20:31:48.5605437Z     ) -> None:
2025-05-07T20:31:48.5612221Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5612493Z     
2025-05-07T20:31:48.5612769Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5614817Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.5616683Z 
2025-05-07T20:31:48.5616801Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.5617013Z 
2025-05-07T20:31:48.5617125Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5617531Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5617926Z     T=16384,
2025-05-07T20:31:48.5618121Z     D=7168,
2025-05-07T20:31:48.5618372Z     scale_ub=None,
2025-05-07T20:31:48.5618695Z     contiguous=False,
2025-05-07T20:31:48.5618978Z     compiled=True,
2025-05-07T20:31:48.5619243Z )
2025-05-07T20:31:48.6913358Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6913900Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:48.6914293Z 
2025-05-07T20:31:48.6914401Z     @given(
2025-05-07T20:31:48.6914666Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6914972Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6915278Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6915609Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6915939Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6916227Z     )
2025-05-07T20:31:48.6916575Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6917008Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6917252Z         self,
2025-05-07T20:31:48.6917447Z         T: int,
2025-05-07T20:31:48.6917639Z         D: int,
2025-05-07T20:31:48.6917856Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6918127Z         contiguous: bool,
2025-05-07T20:31:48.6918359Z         compiled: bool,
2025-05-07T20:31:48.6918591Z     ) -> None:
2025-05-07T20:31:48.6918809Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6919223Z     
2025-05-07T20:31:48.6919495Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6921541Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.6923564Z 
2025-05-07T20:31:48.6923684Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.6923894Z 
2025-05-07T20:31:48.6924002Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6924418Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6924827Z     T=4096,
2025-05-07T20:31:48.6925013Z     D=7168,
2025-05-07T20:31:48.6925200Z     scale_ub=None,
2025-05-07T20:31:48.6925408Z     contiguous=True,
2025-05-07T20:31:48.6925627Z     compiled=False,
2025-05-07T20:31:48.6925832Z )
2025-05-07T20:31:48.6926140Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6926635Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:48.6926903Z 
2025-05-07T20:31:48.6926990Z     @given(
2025-05-07T20:31:48.6927211Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6927518Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6927825Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6928149Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6928474Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6928753Z     )
2025-05-07T20:31:48.6929100Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6929533Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6929773Z         self,
2025-05-07T20:31:48.6929965Z         T: int,
2025-05-07T20:31:48.6930161Z         D: int,
2025-05-07T20:31:48.6930377Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6930657Z         contiguous: bool,
2025-05-07T20:31:48.6930891Z         compiled: bool,
2025-05-07T20:31:48.6931114Z     ) -> None:
2025-05-07T20:31:48.6931325Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6931567Z     
2025-05-07T20:31:48.6931831Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6933864Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.6935716Z 
2025-05-07T20:31:48.6935833Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.6936046Z 
2025-05-07T20:31:48.6936152Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6936560Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6936957Z     T=16384,
2025-05-07T20:31:48.6937143Z     D=7168,
2025-05-07T20:31:48.6937333Z     scale_ub=None,
2025-05-07T20:31:48.6937541Z     contiguous=True,
2025-05-07T20:31:48.6937764Z     compiled=False,
2025-05-07T20:31:48.6937964Z )
2025-05-07T20:31:48.6938270Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6938754Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:48.6939110Z 
2025-05-07T20:31:48.6939191Z     @given(
2025-05-07T20:31:48.6939411Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6939721Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6940021Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6940414Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6940739Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6941023Z     )
2025-05-07T20:31:48.6941364Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6941793Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6942035Z         self,
2025-05-07T20:31:48.6942228Z         T: int,
2025-05-07T20:31:48.6942418Z         D: int,
2025-05-07T20:31:48.6942635Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6942905Z         contiguous: bool,
2025-05-07T20:31:48.6943167Z         compiled: bool,
2025-05-07T20:31:48.6943406Z     ) -> None:
2025-05-07T20:31:48.6943622Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6943857Z     
2025-05-07T20:31:48.6944121Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6946171Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.6948049Z 
2025-05-07T20:31:48.6948165Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.6948372Z 
2025-05-07T20:31:48.6948480Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6948892Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6949288Z     T=16384,
2025-05-07T20:31:48.6949475Z     D=7168,
2025-05-07T20:31:48.6949659Z     scale_ub=1200.0,
2025-05-07T20:31:48.6949974Z     contiguous=True,
2025-05-07T20:31:48.6950193Z     compiled=False,
2025-05-07T20:31:48.6950394Z )
2025-05-07T20:31:48.6950703Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6951189Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:48.6951461Z 
2025-05-07T20:31:48.6951543Z     @given(
2025-05-07T20:31:48.6951764Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6952074Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6952375Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6952693Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6953022Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6953335Z     )
2025-05-07T20:31:48.6953696Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6954130Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6954368Z         self,
2025-05-07T20:31:48.6954556Z         T: int,
2025-05-07T20:31:48.6954749Z         D: int,
2025-05-07T20:31:48.6954965Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6955227Z         contiguous: bool,
2025-05-07T20:31:48.6955465Z         compiled: bool,
2025-05-07T20:31:48.6955684Z     ) -> None:
2025-05-07T20:31:48.6955901Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6956136Z     
2025-05-07T20:31:48.6956398Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6958542Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.6960492Z 
2025-05-07T20:31:48.6960612Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.6960822Z 
2025-05-07T20:31:48.6960935Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6961345Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6961740Z     T=128,
2025-05-07T20:31:48.6961927Z     D=5120,
2025-05-07T20:31:48.6962111Z     scale_ub=1200.0,
2025-05-07T20:31:48.6962334Z     contiguous=False,
2025-05-07T20:31:48.6962557Z     compiled=False,
2025-05-07T20:31:48.6962759Z )
2025-05-07T20:31:49.0713394Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.0713953Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:49.0714234Z 
2025-05-07T20:31:49.0714314Z     @given(
2025-05-07T20:31:49.0714547Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.0714858Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.0715164Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.0715495Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.0715825Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.0716102Z     )
2025-05-07T20:31:49.0716451Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.0716891Z     def test_silu_mul_quant(
2025-05-07T20:31:49.0717130Z         self,
2025-05-07T20:31:49.0717324Z         T: int,
2025-05-07T20:31:49.0717522Z         D: int,
2025-05-07T20:31:49.0717741Z         scale_ub: Optional[float],
2025-05-07T20:31:49.0718009Z         contiguous: bool,
2025-05-07T20:31:49.0718255Z         compiled: bool,
2025-05-07T20:31:49.0718482Z     ) -> None:
2025-05-07T20:31:49.0718695Z         torch.manual_seed(2025)
2025-05-07T20:31:49.0718936Z     
2025-05-07T20:31:49.0719208Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.0719546Z     
2025-05-07T20:31:49.0719744Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.0720033Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.0720340Z         x = x_sign * x_clamp
2025-05-07T20:31:49.0720580Z         x0 = x[:, :D]
2025-05-07T20:31:49.0720798Z         x1 = x[:, D:]
2025-05-07T20:31:49.0721000Z     
2025-05-07T20:31:49.0721186Z         if contiguous:
2025-05-07T20:31:49.0721418Z             x0 = x0.contiguous()
2025-05-07T20:31:49.0721673Z             x1 = x1.contiguous()
2025-05-07T20:31:49.0721912Z     
2025-05-07T20:31:49.0722106Z         if scale_ub is not None:
2025-05-07T20:31:49.0722373Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.0722715Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.0723027Z             )
2025-05-07T20:31:49.0723222Z         else:
2025-05-07T20:31:49.0723427Z             scale_ub_tensor = None
2025-05-07T20:31:49.0723681Z     
2025-05-07T20:31:49.0723914Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.0724226Z             op = silu_mul_quant
2025-05-07T20:31:49.0724476Z             if compiled:
2025-05-07T20:31:49.0724725Z                 op = torch.compile(op)
2025-05-07T20:31:49.0725016Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.0725296Z     
2025-05-07T20:31:49.0725488Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.0725652Z 
2025-05-07T20:31:49.0725753Z moe/activation_test.py:117: 
2025-05-07T20:31:49.0726054Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.0726385Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.0726669Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.0727509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.0728218Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.0728755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.0729578Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.0730240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.0730771Z     kernel = self.compile(
2025-05-07T20:31:49.0731309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.0731958Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.0732356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.0732588Z 
2025-05-07T20:31:49.0732798Z self = <triton.compiler.compiler.ASTSource object at 0x7fa5b1ebcbe0>
2025-05-07T20:31:49.0733933Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.0735315Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa5b1e7d670>}
2025-05-07T20:31:49.0736659Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.0737684Z context = <triton._C.libtriton.ir.context object at 0x7fa5b1e182b0>
2025-05-07T20:31:49.0737969Z 
2025-05-07T20:31:49.0738143Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.0738661Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.0739126Z                            module_map=module_map)
2025-05-07T20:31:49.0739488Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.0739849Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.0740104Z E       ^
2025-05-07T20:31:49.0740571Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.0741021Z 
2025-05-07T20:31:49.0741438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.0741947Z 
2025-05-07T20:31:49.0742050Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.0742462Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.0742861Z     T=2048,
2025-05-07T20:31:49.0743052Z     D=7168,
2025-05-07T20:31:49.0743251Z     scale_ub=None,
2025-05-07T20:31:49.0743507Z     contiguous=False,
2025-05-07T20:31:49.0743732Z     compiled=False,
2025-05-07T20:31:49.0743930Z )
2025-05-07T20:31:49.0744244Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.0744745Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:49.0745016Z 
2025-05-07T20:31:49.0745093Z     @given(
2025-05-07T20:31:49.0745322Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.0745634Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.0745936Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.0746267Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.0746597Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.0746895Z     )
2025-05-07T20:31:49.0747320Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.0747759Z     def test_silu_mul_quant(
2025-05-07T20:31:49.0748000Z         self,
2025-05-07T20:31:49.0748189Z         T: int,
2025-05-07T20:31:49.0748391Z         D: int,
2025-05-07T20:31:49.0748604Z         scale_ub: Optional[float],
2025-05-07T20:31:49.0748879Z         contiguous: bool,
2025-05-07T20:31:49.0749195Z         compiled: bool,
2025-05-07T20:31:49.0749412Z     ) -> None:
2025-05-07T20:31:49.0749626Z         torch.manual_seed(2025)
2025-05-07T20:31:49.0749953Z     
2025-05-07T20:31:49.0750220Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.0752287Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.0754193Z 
2025-05-07T20:31:49.0754312Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.0754536Z 
2025-05-07T20:31:49.0754638Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.0755055Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.0755455Z     T=128,
2025-05-07T20:31:49.0755644Z     D=7168,
2025-05-07T20:31:49.0755838Z     scale_ub=1200.0,
2025-05-07T20:31:49.0756063Z     contiguous=True,
2025-05-07T20:31:49.0756286Z     compiled=True,
2025-05-07T20:31:49.0756488Z )
2025-05-07T20:31:49.1213318Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.1213839Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:49.1214115Z 
2025-05-07T20:31:49.1214198Z     @given(
2025-05-07T20:31:49.1214433Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.1214741Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.1215046Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.1215374Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.1215705Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.1215993Z     )
2025-05-07T20:31:49.1216337Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.1216775Z     def test_silu_mul_quant(
2025-05-07T20:31:49.1217013Z         self,
2025-05-07T20:31:49.1217208Z         T: int,
2025-05-07T20:31:49.1217410Z         D: int,
2025-05-07T20:31:49.1217622Z         scale_ub: Optional[float],
2025-05-07T20:31:49.1217894Z         contiguous: bool,
2025-05-07T20:31:49.1218133Z         compiled: bool,
2025-05-07T20:31:49.1218350Z     ) -> None:
2025-05-07T20:31:49.1218565Z         torch.manual_seed(2025)
2025-05-07T20:31:49.1218812Z     
2025-05-07T20:31:49.1219076Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.1219414Z     
2025-05-07T20:31:49.1219605Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.1219889Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.1220207Z         x = x_sign * x_clamp
2025-05-07T20:31:49.1220447Z         x0 = x[:, :D]
2025-05-07T20:31:49.1220658Z         x1 = x[:, D:]
2025-05-07T20:31:49.1220865Z     
2025-05-07T20:31:49.1221051Z         if contiguous:
2025-05-07T20:31:49.1221278Z             x0 = x0.contiguous()
2025-05-07T20:31:49.1221529Z             x1 = x1.contiguous()
2025-05-07T20:31:49.1221770Z     
2025-05-07T20:31:49.1221963Z         if scale_ub is not None:
2025-05-07T20:31:49.1222228Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.1222561Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.1222870Z             )
2025-05-07T20:31:49.1223200Z         else:
2025-05-07T20:31:49.1223413Z             scale_ub_tensor = None
2025-05-07T20:31:49.1223662Z     
2025-05-07T20:31:49.1223890Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.1224203Z             op = silu_mul_quant
2025-05-07T20:31:49.1224450Z             if compiled:
2025-05-07T20:31:49.1224836Z                 op = torch.compile(op)
2025-05-07T20:31:49.1225126Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.1225396Z     
2025-05-07T20:31:49.1225587Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.1225751Z 
2025-05-07T20:31:49.1225851Z moe/activation_test.py:117: 
2025-05-07T20:31:49.1226148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.1226477Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.1226756Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.1227306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.1227866Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.1228522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.1229206Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.1229741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.1230493Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.1231148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.1231673Z     kernel = self.compile(
2025-05-07T20:31:49.1232205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.1232855Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.1233299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.1233532Z 
2025-05-07T20:31:49.1233741Z self = <triton.compiler.compiler.ASTSource object at 0x7fa5b1dd1100>
2025-05-07T20:31:49.1234824Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.1236205Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa5b1e665e0>}
2025-05-07T20:31:49.1237551Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.1238564Z context = <triton._C.libtriton.ir.context object at 0x7fa5b1dbca30>
2025-05-07T20:31:49.1238859Z 
2025-05-07T20:31:49.1239026Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.1239548Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.1240013Z                            module_map=module_map)
2025-05-07T20:31:49.1240376Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.1240727Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.1240985Z E       ^
2025-05-07T20:31:49.1241441Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.1241894Z 
2025-05-07T20:31:49.1242307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.1242819Z 
2025-05-07T20:31:49.1242920Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.1243420Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.1243818Z     T=128,
2025-05-07T20:31:49.1244006Z     D=7168,
2025-05-07T20:31:49.1244201Z     scale_ub=1200.0,
2025-05-07T20:31:49.1244426Z     contiguous=True,
2025-05-07T20:31:49.1244646Z     compiled=False,
2025-05-07T20:31:49.1244850Z )
2025-05-07T20:31:49.1245162Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.1245731Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:49.1246004Z 
2025-05-07T20:31:49.1246080Z     @given(
2025-05-07T20:31:49.1246309Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.1246613Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.1246918Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.1247247Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.1247571Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.1247857Z     )
2025-05-07T20:31:49.1248214Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.1248649Z     def test_silu_mul_quant(
2025-05-07T20:31:49.1248890Z         self,
2025-05-07T20:31:49.1249081Z         T: int,
2025-05-07T20:31:49.1249272Z         D: int,
2025-05-07T20:31:49.1249491Z         scale_ub: Optional[float],
2025-05-07T20:31:49.1249771Z         contiguous: bool,
2025-05-07T20:31:49.1250007Z         compiled: bool,
2025-05-07T20:31:49.1250225Z     ) -> None:
2025-05-07T20:31:49.1250441Z         torch.manual_seed(2025)
2025-05-07T20:31:49.1250683Z     
2025-05-07T20:31:49.1250945Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.1251287Z     
2025-05-07T20:31:49.1251480Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.1251769Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.1253824Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.1255684Z 
2025-05-07T20:31:49.1255803Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:49.1256026Z 
2025-05-07T20:31:49.1256127Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.1256538Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.1256942Z     T=128,
2025-05-07T20:31:49.1257124Z     D=5120,
2025-05-07T20:31:49.1257316Z     scale_ub=1200.0,
2025-05-07T20:31:49.1257540Z     contiguous=True,
2025-05-07T20:31:49.1257760Z     compiled=True,
2025-05-07T20:31:49.1264609Z )
2025-05-07T20:31:49.1264957Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.1265448Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:49.1265719Z 
2025-05-07T20:31:49.1265800Z     @given(
2025-05-07T20:31:49.1266030Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.1266341Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.1266644Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.1266974Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.1267299Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.1267585Z     )
2025-05-07T20:31:49.1267929Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.1268364Z     def test_silu_mul_quant(
2025-05-07T20:31:49.1268605Z         self,
2025-05-07T20:31:49.1268801Z         T: int,
2025-05-07T20:31:49.1268989Z         D: int,
2025-05-07T20:31:49.1269317Z         scale_ub: Optional[float],
2025-05-07T20:31:49.1269592Z         contiguous: bool,
2025-05-07T20:31:49.1269872Z         compiled: bool,
2025-05-07T20:31:49.1270112Z     ) -> None:
2025-05-07T20:31:49.1270337Z         torch.manual_seed(2025)
2025-05-07T20:31:49.1270597Z     
2025-05-07T20:31:49.1270884Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.1271355Z     
2025-05-07T20:31:49.1271557Z >       x_sign = torch.sign(x)
2025-05-07T20:31:49.1274042Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.1276437Z 
2025-05-07T20:31:49.1276556Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:49.1276774Z 
2025-05-07T20:31:49.1276877Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.1277286Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.1277693Z     T=128,
2025-05-07T20:31:49.1277871Z     D=7168,
2025-05-07T20:31:49.1278060Z     scale_ub=None,
2025-05-07T20:31:49.1278274Z     contiguous=True,
2025-05-07T20:31:49.1278490Z     compiled=True,
2025-05-07T20:31:49.1278688Z )
2025-05-07T20:31:49.4113787Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4114299Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:49.4114559Z 
2025-05-07T20:31:49.4114636Z     @given(
2025-05-07T20:31:49.4114862Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4115181Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4115479Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4115802Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4116121Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4116396Z     )
2025-05-07T20:31:49.4116747Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4117181Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4117408Z         self,
2025-05-07T20:31:49.4117595Z         T: int,
2025-05-07T20:31:49.4117791Z         D: int,
2025-05-07T20:31:49.4117998Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4118258Z         contiguous: bool,
2025-05-07T20:31:49.4118489Z         compiled: bool,
2025-05-07T20:31:49.4118710Z     ) -> None:
2025-05-07T20:31:49.4118914Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4119147Z     
2025-05-07T20:31:49.4119404Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4121458Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.4123310Z 
2025-05-07T20:31:49.4123425Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.4123640Z 
2025-05-07T20:31:49.4185252Z FAILED
2025-05-07T20:31:49.4185555Z 
2025-05-07T20:31:49.4185795Z =================================== FAILURES ===================================
2025-05-07T20:31:49.4186423Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:31:49.4187257Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:31:49.4188103Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 59, in testPartExecutor
2025-05-07T20:31:49.4188842Z   |     yield
2025-05-07T20:31:49.4189414Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 592, in run
2025-05-07T20:31:49.4190333Z   |     self._callTestMethod(testMethod)
2025-05-07T20:31:49.4191094Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 550, in _callTestMethod
2025-05-07T20:31:49.4191819Z   |     method()
2025-05-07T20:31:49.4192670Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:31:49.4194011Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4194927Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:31:49.4195772Z   |     raise the_error_hypothesis_found
2025-05-07T20:31:49.4196441Z   | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:31:49.4197100Z   +-+---------------- 1 ----------------
2025-05-07T20:31:49.4197502Z     | Traceback (most recent call last):
2025-05-07T20:31:49.4198473Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:31:49.4199522Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4202334Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.4205055Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:49.4205501Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4205897Z     |     T=128,
2025-05-07T20:31:49.4206092Z     |     D=7168,
2025-05-07T20:31:49.4206298Z     |     scale_ub=1200.0,
2025-05-07T20:31:49.4206529Z     |     contiguous=True,
2025-05-07T20:31:49.4206766Z     |     compiled=False,
2025-05-07T20:31:49.4206989Z     | )
2025-05-07T20:31:49.4207159Z     | 
2025-05-07T20:31:49.4207678Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case
2025-05-07T20:31:49.4208287Z     +---------------- 2 ----------------
2025-05-07T20:31:49.4208570Z     | Traceback (most recent call last):
2025-05-07T20:31:49.4209268Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:31:49.4210042Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4212099Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.4214266Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:49.4214703Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4215096Z     |     T=128,
2025-05-07T20:31:49.4215293Z     |     D=7168,
2025-05-07T20:31:49.4215499Z     |     scale_ub=None,
2025-05-07T20:31:49.4215723Z     |     contiguous=True,
2025-05-07T20:31:49.4216077Z     |     compiled=True,
2025-05-07T20:31:49.4216293Z     | )
2025-05-07T20:31:49.4216459Z     | 
2025-05-07T20:31:49.4216977Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:31:49.4217570Z     +---------------- 3 ----------------
2025-05-07T20:31:49.4217844Z     | Traceback (most recent call last):
2025-05-07T20:31:49.4218545Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:31:49.4219318Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4221356Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.4223789Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:49.4224399Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4224969Z     |     T=128,
2025-05-07T20:31:49.4225247Z     |     D=5120,
2025-05-07T20:31:49.4225536Z     |     scale_ub=1200.0,
2025-05-07T20:31:49.4225861Z     |     contiguous=True,
2025-05-07T20:31:49.4226176Z     |     compiled=True,
2025-05-07T20:31:49.4226495Z     | )
2025-05-07T20:31:49.4226722Z     | 
2025-05-07T20:31:49.4227441Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:31:49.4228266Z     +---------------- 4 ----------------
2025-05-07T20:31:49.4228654Z     | Traceback (most recent call last):
2025-05-07T20:31:49.4229621Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:31:49.4230704Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:49.4231596Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:31:49.4232527Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.4233726Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:31:49.4234814Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:49.4235627Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:31:49.4236612Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4237663Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:31:49.4238738Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.4239837Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:31:49.4241031Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.4242079Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:31:49.4242772Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:49.4243536Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:31:49.4244080Z     |     fn()
2025-05-07T20:31:49.4244640Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:31:49.4245262Z     |     self.fn.run(
2025-05-07T20:31:49.4245776Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:31:49.4246346Z     |     kernel = self.compile(
2025-05-07T20:31:49.4246948Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:31:49.4247642Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4248332Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:49.4249112Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4249617Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4249964Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:31:49.4250211Z     | ^
2025-05-07T20:31:49.4250661Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4251212Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:49.4251601Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:31:49.4252110Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4252546Z     |     T=1,  # or any other generated value
2025-05-07T20:31:49.4252847Z     |     D=5120,  # or any other generated value
2025-05-07T20:31:49.4253206Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:31:49.4253577Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:31:49.4254035Z     |     compiled=True,  # or any other generated value
2025-05-07T20:31:49.4254438Z     | )
2025-05-07T20:31:49.4254682Z     | 
2025-05-07T20:31:49.4255395Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:31:49.4256220Z     +------------------------------------
2025-05-07T20:31:49.4256720Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:31:49.4257228Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4257801Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4258344Z     T=1,
2025-05-07T20:31:49.4258597Z     D=5120,
2025-05-07T20:31:49.4258864Z     scale_ub=None,
2025-05-07T20:31:49.4259157Z     contiguous=True,
2025-05-07T20:31:49.4304531Z     compiled=True,
2025-05-07T20:31:49.4304885Z )
2025-05-07T20:31:49.4305331Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4306012Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:49.4306368Z 
2025-05-07T20:31:49.4306488Z     @given(
2025-05-07T20:31:49.4306804Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4307235Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4307651Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4308090Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4308843Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4309244Z     )
2025-05-07T20:31:49.4309711Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4310448Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4310790Z         self,
2025-05-07T20:31:49.4311058Z         T: int,
2025-05-07T20:31:49.4311493Z         D: int,
2025-05-07T20:31:49.4311796Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4312165Z         contiguous: bool,
2025-05-07T20:31:49.4312494Z         compiled: bool,
2025-05-07T20:31:49.4312807Z     ) -> None:
2025-05-07T20:31:49.4313136Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4313485Z     
2025-05-07T20:31:49.4313850Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4314309Z     
2025-05-07T20:31:49.4314564Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4314951Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4315372Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4315702Z         x0 = x[:, :D]
2025-05-07T20:31:49.4315987Z         x1 = x[:, D:]
2025-05-07T20:31:49.4316259Z     
2025-05-07T20:31:49.4316511Z         if contiguous:
2025-05-07T20:31:49.4316828Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4317197Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4317530Z     
2025-05-07T20:31:49.4317765Z         if scale_ub is not None:
2025-05-07T20:31:49.4318104Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4318508Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4318885Z             )
2025-05-07T20:31:49.4319150Z         else:
2025-05-07T20:31:49.4319441Z             scale_ub_tensor = None
2025-05-07T20:31:49.4319779Z     
2025-05-07T20:31:49.4320066Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4320456Z             op = silu_mul_quant
2025-05-07T20:31:49.4320756Z             if compiled:
2025-05-07T20:31:49.4321055Z                 op = torch.compile(op)
2025-05-07T20:31:49.4321430Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4321776Z     
2025-05-07T20:31:49.4322041Z         y_fp8, y_scale = fn()
2025-05-07T20:31:49.4322409Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:49.4322760Z     
2025-05-07T20:31:49.4323047Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4323544Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:49.4323895Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:49.4324290Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:49.4324743Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.4325151Z     
2025-05-07T20:31:49.4325392Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:49.4325636Z 
2025-05-07T20:31:49.4325757Z moe/activation_test.py:126: 
2025-05-07T20:31:49.4326120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4326527Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:49.4326927Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.4327902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:49.4328850Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:49.4329514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4330419Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4331366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:49.4332350Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.4333401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:49.4334322Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.4335219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:49.4336007Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:49.4336830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:49.4337467Z     fn()
2025-05-07T20:31:49.4338086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:49.4338799Z     self.fn.run(
2025-05-07T20:31:49.4339372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4340026Z     kernel = self.compile(
2025-05-07T20:31:49.4340689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4341495Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4341997Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4342279Z 
2025-05-07T20:31:49.4342536Z self = <triton.compiler.compiler.ASTSource object at 0x7faba5c5b730>
2025-05-07T20:31:49.4343947Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4345721Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faba5c74820>}
2025-05-07T20:31:49.4347462Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4348752Z context = <triton._C.libtriton.ir.context object at 0x7faba5c4d330>
2025-05-07T20:31:49.4349133Z 
2025-05-07T20:31:49.4349334Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4350144Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4350782Z                            module_map=module_map)
2025-05-07T20:31:49.4351268Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4351736Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:49.4352098Z E       ^
2025-05-07T20:31:49.4352726Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4353361Z 
2025-05-07T20:31:49.4353947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4354635Z 
2025-05-07T20:31:49.4354773Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4355333Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4355887Z     T=2048,
2025-05-07T20:31:49.4356138Z     D=5120,
2025-05-07T20:31:49.4356411Z     scale_ub=1200.0,
2025-05-07T20:31:49.4356717Z     contiguous=True,
2025-05-07T20:31:49.4357018Z     compiled=False,
2025-05-07T20:31:49.4357298Z )
2025-05-07T20:31:49.4357731Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4358400Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:49.4358785Z 
2025-05-07T20:31:49.4358892Z     @given(
2025-05-07T20:31:49.4359218Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4359647Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4360064Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4360710Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4361170Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4361552Z     )
2025-05-07T20:31:49.4362031Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4362639Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4363064Z         self,
2025-05-07T20:31:49.4363331Z         T: int,
2025-05-07T20:31:49.4363605Z         D: int,
2025-05-07T20:31:49.4363899Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4364270Z         contiguous: bool,
2025-05-07T20:31:49.4364595Z         compiled: bool,
2025-05-07T20:31:49.4364902Z     ) -> None:
2025-05-07T20:31:49.4365194Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4365531Z     
2025-05-07T20:31:49.4365904Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4366362Z     
2025-05-07T20:31:49.4366629Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4367024Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4367422Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4367748Z         x0 = x[:, :D]
2025-05-07T20:31:49.4368044Z         x1 = x[:, D:]
2025-05-07T20:31:49.4368324Z     
2025-05-07T20:31:49.4368579Z         if contiguous:
2025-05-07T20:31:49.4368903Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4369261Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4369593Z     
2025-05-07T20:31:49.4369858Z         if scale_ub is not None:
2025-05-07T20:31:49.4370232Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4370712Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4371136Z             )
2025-05-07T20:31:49.4371402Z         else:
2025-05-07T20:31:49.4371692Z             scale_ub_tensor = None
2025-05-07T20:31:49.4372032Z     
2025-05-07T20:31:49.4372347Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4393463Z             op = silu_mul_quant
2025-05-07T20:31:49.4393900Z             if compiled:
2025-05-07T20:31:49.4394260Z                 op = torch.compile(op)
2025-05-07T20:31:49.4394680Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4395067Z     
2025-05-07T20:31:49.4395327Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.4395561Z 
2025-05-07T20:31:49.4395706Z moe/activation_test.py:117: 
2025-05-07T20:31:49.4396122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4396576Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.4396970Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4397930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.4398856Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.4399544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4400425Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4401305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4402011Z     kernel = self.compile(
2025-05-07T20:31:49.4402694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4403570Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4404396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4404694Z 
2025-05-07T20:31:49.4404951Z self = <triton.compiler.compiler.ASTSource object at 0x7faba0eb0670>
2025-05-07T20:31:49.4406345Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4408524Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faba62e1ee0>}
2025-05-07T20:31:49.4410273Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4411808Z context = <triton._C.libtriton.ir.context object at 0x7faba5bebcb0>
2025-05-07T20:31:49.4412232Z 
2025-05-07T20:31:49.4412467Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4413232Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4413938Z                            module_map=module_map)
2025-05-07T20:31:49.4414448Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4414922Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.4415300Z E       ^
2025-05-07T20:31:49.4415948Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4416595Z 
2025-05-07T20:31:49.4417178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4417921Z 
2025-05-07T20:31:49.4418064Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4418639Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4419188Z     T=2048,
2025-05-07T20:31:49.4419454Z     D=5120,
2025-05-07T20:31:49.4419726Z     scale_ub=1200.0,
2025-05-07T20:31:49.4420030Z     contiguous=True,
2025-05-07T20:31:49.4420335Z     compiled=True,
2025-05-07T20:31:49.4420620Z )
2025-05-07T20:31:49.4421049Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4421729Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:49.4422103Z 
2025-05-07T20:31:49.4422209Z     @given(
2025-05-07T20:31:49.4422523Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4422941Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4423358Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4423816Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4424243Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4424625Z     )
2025-05-07T20:31:49.4425092Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4425678Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4426001Z         self,
2025-05-07T20:31:49.4426262Z         T: int,
2025-05-07T20:31:49.4426530Z         D: int,
2025-05-07T20:31:49.4426815Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4427184Z         contiguous: bool,
2025-05-07T20:31:49.4427509Z         compiled: bool,
2025-05-07T20:31:49.4427811Z     ) -> None:
2025-05-07T20:31:49.4428108Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4428470Z     
2025-05-07T20:31:49.4428855Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4429361Z     
2025-05-07T20:31:49.4429633Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4430165Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4430598Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4430940Z         x0 = x[:, :D]
2025-05-07T20:31:49.4431246Z         x1 = x[:, D:]
2025-05-07T20:31:49.4431533Z     
2025-05-07T20:31:49.4431794Z         if contiguous:
2025-05-07T20:31:49.4432113Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4432462Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4432798Z     
2025-05-07T20:31:49.4433094Z         if scale_ub is not None:
2025-05-07T20:31:49.4433489Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4433949Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4434519Z             )
2025-05-07T20:31:49.4434787Z         else:
2025-05-07T20:31:49.4435086Z             scale_ub_tensor = None
2025-05-07T20:31:49.4435436Z     
2025-05-07T20:31:49.4435739Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4436170Z             op = silu_mul_quant
2025-05-07T20:31:49.4436606Z             if compiled:
2025-05-07T20:31:49.4436946Z                 op = torch.compile(op)
2025-05-07T20:31:49.4437351Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4437718Z     
2025-05-07T20:31:49.4437980Z         y_fp8, y_scale = fn()
2025-05-07T20:31:49.4438376Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:49.4438781Z     
2025-05-07T20:31:49.4439114Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4439566Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:49.4439975Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:49.4440425Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:49.4440923Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.4441360Z     
2025-05-07T20:31:49.4441642Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:49.4441926Z 
2025-05-07T20:31:49.4442077Z moe/activation_test.py:126: 
2025-05-07T20:31:49.4442508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4442986Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:49.4443505Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.4444610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:49.4445676Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:49.4446438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4447400Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4448349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:49.4449355Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.4450414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:49.4451457Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.4452480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:49.4453357Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:49.4454212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:49.4454935Z     fn()
2025-05-07T20:31:49.4455647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:49.4456389Z     self.fn.run(
2025-05-07T20:31:49.4457004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4457722Z     kernel = self.compile(
2025-05-07T20:31:49.4458476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4459383Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4459940Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4460267Z 
2025-05-07T20:31:49.4460551Z self = <triton.compiler.compiler.ASTSource object at 0x7faba5c562e0>
2025-05-07T20:31:49.4462152Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4464084Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faba62dd5e0>}
2025-05-07T20:31:49.4465861Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4467136Z context = <triton._C.libtriton.ir.context object at 0x7fab683aba70>
2025-05-07T20:31:49.4467515Z 
2025-05-07T20:31:49.4467729Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4468428Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4469035Z                            module_map=module_map)
2025-05-07T20:31:49.4469540Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4470095Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:49.4470426Z E       ^
2025-05-07T20:31:49.4471029Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4471640Z 
2025-05-07T20:31:49.4472209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4472911Z 
2025-05-07T20:31:49.4473054Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4473615Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4474156Z     T=16384,
2025-05-07T20:31:49.4474425Z     D=7168,
2025-05-07T20:31:49.4474692Z     scale_ub=1200.0,
2025-05-07T20:31:49.4474996Z     contiguous=False,
2025-05-07T20:31:49.4475306Z     compiled=False,
2025-05-07T20:31:49.4475590Z )
2025-05-07T20:31:49.4476021Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4476698Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:49.4477083Z 
2025-05-07T20:31:49.4477203Z     @given(
2025-05-07T20:31:49.4477510Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4477916Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4478358Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4478789Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4479210Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4479599Z     )
2025-05-07T20:31:49.4480034Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4480573Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4480871Z         self,
2025-05-07T20:31:49.4481114Z         T: int,
2025-05-07T20:31:49.4481351Z         D: int,
2025-05-07T20:31:49.4481624Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4481995Z         contiguous: bool,
2025-05-07T20:31:49.4482329Z         compiled: bool,
2025-05-07T20:31:49.4482649Z     ) -> None:
2025-05-07T20:31:49.4482955Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4483293Z     
2025-05-07T20:31:49.4483680Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4484177Z     
2025-05-07T20:31:49.4484456Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4484852Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4485284Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4485625Z         x0 = x[:, :D]
2025-05-07T20:31:49.4485918Z         x1 = x[:, D:]
2025-05-07T20:31:49.4486203Z     
2025-05-07T20:31:49.4486464Z         if contiguous:
2025-05-07T20:31:49.4486766Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4487087Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4487391Z     
2025-05-07T20:31:49.4487623Z         if scale_ub is not None:
2025-05-07T20:31:49.4488069Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4488487Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4488861Z             )
2025-05-07T20:31:49.4489133Z         else:
2025-05-07T20:31:49.4489425Z             scale_ub_tensor = None
2025-05-07T20:31:49.4489760Z     
2025-05-07T20:31:49.4490067Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4490613Z             op = silu_mul_quant
2025-05-07T20:31:49.4490941Z             if compiled:
2025-05-07T20:31:49.4491202Z                 op = torch.compile(op)
2025-05-07T20:31:49.4491504Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4491783Z     
2025-05-07T20:31:49.4491979Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.4492150Z 
2025-05-07T20:31:49.4492259Z moe/activation_test.py:117: 
2025-05-07T20:31:49.4492558Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4492885Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.4493176Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4493873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.4494578Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.4495116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4495813Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4496480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4497009Z     kernel = self.compile(
2025-05-07T20:31:49.4497554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4498212Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4498617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4498848Z 
2025-05-07T20:31:49.4499062Z self = <triton.compiler.compiler.ASTSource object at 0x7faba6350340>
2025-05-07T20:31:49.4500155Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4501555Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fab6864e160>}
2025-05-07T20:31:49.4502914Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4504310Z context = <triton._C.libtriton.ir.context object at 0x7fab865b8130>
2025-05-07T20:31:49.4504610Z 
2025-05-07T20:31:49.4504783Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4505317Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4505784Z                            module_map=module_map)
2025-05-07T20:31:49.4506150Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4506509Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.4506768Z E       ^
2025-05-07T20:31:49.4507232Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4507692Z 
2025-05-07T20:31:49.4508112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4508630Z 
2025-05-07T20:31:49.4508734Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4509148Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4509762Z     T=1,
2025-05-07T20:31:49.4510040Z     D=7168,
2025-05-07T20:31:49.4510228Z     scale_ub=None,
2025-05-07T20:31:49.4510435Z     contiguous=True,
2025-05-07T20:31:49.4510654Z     compiled=True,
2025-05-07T20:31:49.4510850Z )
2025-05-07T20:31:49.4511163Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4511773Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:49.4512033Z 
2025-05-07T20:31:49.4512109Z     @given(
2025-05-07T20:31:49.4512334Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4512638Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4512948Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4513293Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4513650Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4513937Z     )
2025-05-07T20:31:49.4514291Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4514729Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4514964Z         self,
2025-05-07T20:31:49.4515156Z         T: int,
2025-05-07T20:31:49.4515353Z         D: int,
2025-05-07T20:31:49.4515564Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4515844Z         contiguous: bool,
2025-05-07T20:31:49.4516076Z         compiled: bool,
2025-05-07T20:31:49.4516292Z     ) -> None:
2025-05-07T20:31:49.4516511Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4516751Z     
2025-05-07T20:31:49.4517014Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4517355Z     
2025-05-07T20:31:49.4517547Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4517829Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4518135Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4518376Z         x0 = x[:, :D]
2025-05-07T20:31:49.4518589Z         x1 = x[:, D:]
2025-05-07T20:31:49.4518794Z     
2025-05-07T20:31:49.4518982Z         if contiguous:
2025-05-07T20:31:49.4519204Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4519464Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4519708Z     
2025-05-07T20:31:49.4519898Z         if scale_ub is not None:
2025-05-07T20:31:49.4520165Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4520511Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4520814Z             )
2025-05-07T20:31:49.4521002Z         else:
2025-05-07T20:31:49.4521213Z             scale_ub_tensor = None
2025-05-07T20:31:49.4521464Z     
2025-05-07T20:31:49.4521690Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4522005Z             op = silu_mul_quant
2025-05-07T20:31:49.4522255Z             if compiled:
2025-05-07T20:31:49.4522497Z                 op = torch.compile(op)
2025-05-07T20:31:49.4522792Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4523070Z     
2025-05-07T20:31:49.4523262Z         y_fp8, y_scale = fn()
2025-05-07T20:31:49.4523565Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:49.4523856Z     
2025-05-07T20:31:49.4524092Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4524422Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:49.4524721Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:49.4525032Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:49.4525382Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.4525685Z     
2025-05-07T20:31:49.4525883Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:49.4526076Z 
2025-05-07T20:31:49.4526175Z moe/activation_test.py:126: 
2025-05-07T20:31:49.4526470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4526806Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:49.4527132Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.4528001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:49.4528769Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:49.4529319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4530071Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4530760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:49.4531486Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.4532238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:49.4533082Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.4533994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:49.4534790Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:49.4535502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:49.4536016Z     fn()
2025-05-07T20:31:49.4536515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:49.4537088Z     self.fn.run(
2025-05-07T20:31:49.4537545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4538071Z     kernel = self.compile(
2025-05-07T20:31:49.4538608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4539264Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4539652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4539883Z 
2025-05-07T20:31:49.4540092Z self = <triton.compiler.compiler.ASTSource object at 0x7fab683a1fa0>
2025-05-07T20:31:49.4541178Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4542576Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faba5c20280>}
2025-05-07T20:31:49.4544148Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4545307Z context = <triton._C.libtriton.ir.context object at 0x7fab69032930>
2025-05-07T20:31:49.4545606Z 
2025-05-07T20:31:49.4545775Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4546304Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4546767Z                            module_map=module_map)
2025-05-07T20:31:49.4547133Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4547491Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:49.4547756Z E       ^
2025-05-07T20:31:49.4548219Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4548673Z 
2025-05-07T20:31:49.4549088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4549597Z 
2025-05-07T20:31:49.4549704Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4550270Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4550680Z     T=4096,
2025-05-07T20:31:49.4550866Z     D=5120,
2025-05-07T20:31:49.4551061Z     scale_ub=None,
2025-05-07T20:31:49.4551270Z     contiguous=False,
2025-05-07T20:31:49.4551492Z     compiled=False,
2025-05-07T20:31:49.4551772Z )
2025-05-07T20:31:49.4552091Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4552583Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:49.4552854Z 
2025-05-07T20:31:49.4552933Z     @given(
2025-05-07T20:31:49.4553152Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4553459Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4553761Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4554082Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4554419Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4554703Z     )
2025-05-07T20:31:49.4555048Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4555480Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4555718Z         self,
2025-05-07T20:31:49.4555908Z         T: int,
2025-05-07T20:31:49.4556104Z         D: int,
2025-05-07T20:31:49.4556319Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4556587Z         contiguous: bool,
2025-05-07T20:31:49.4556816Z         compiled: bool,
2025-05-07T20:31:49.4557035Z     ) -> None:
2025-05-07T20:31:49.4557248Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4557479Z     
2025-05-07T20:31:49.4557743Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4558080Z     
2025-05-07T20:31:49.4558271Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4558558Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4558864Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4559102Z         x0 = x[:, :D]
2025-05-07T20:31:49.4559320Z         x1 = x[:, D:]
2025-05-07T20:31:49.4559524Z     
2025-05-07T20:31:49.4559709Z         if contiguous:
2025-05-07T20:31:49.4559932Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4560190Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4560430Z     
2025-05-07T20:31:49.4560621Z         if scale_ub is not None:
2025-05-07T20:31:49.4560900Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4561233Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4561535Z             )
2025-05-07T20:31:49.4561731Z         else:
2025-05-07T20:31:49.4561942Z             scale_ub_tensor = None
2025-05-07T20:31:49.4562186Z     
2025-05-07T20:31:49.4562416Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4562728Z             op = silu_mul_quant
2025-05-07T20:31:49.4562983Z             if compiled:
2025-05-07T20:31:49.4563266Z                 op = torch.compile(op)
2025-05-07T20:31:49.4563567Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4563833Z     
2025-05-07T20:31:49.4564031Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.4564201Z 
2025-05-07T20:31:49.4564297Z moe/activation_test.py:117: 
2025-05-07T20:31:49.4564589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4564918Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.4565195Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4565886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.4566589Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.4567126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4567807Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4568577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4569105Z     kernel = self.compile(
2025-05-07T20:31:49.4569647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4570299Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4570774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4571006Z 
2025-05-07T20:31:49.4571232Z self = <triton.compiler.compiler.ASTSource object at 0x7fab68fbfa00>
2025-05-07T20:31:49.4580729Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4582134Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fab69195f70>}
2025-05-07T20:31:49.4583482Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4584517Z context = <triton._C.libtriton.ir.context object at 0x7fab68f8b630>
2025-05-07T20:31:49.4584816Z 
2025-05-07T20:31:49.4584989Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4585521Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4585985Z                            module_map=module_map)
2025-05-07T20:31:49.4586355Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4586715Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.4586986Z E       ^
2025-05-07T20:31:49.4587461Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4587919Z 
2025-05-07T20:31:49.4588334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4588845Z 
2025-05-07T20:31:49.4588957Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4589368Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4589768Z     T=4096,
2025-05-07T20:31:49.4590070Z     D=7168,
2025-05-07T20:31:49.4590261Z     scale_ub=None,
2025-05-07T20:31:49.4590470Z     contiguous=False,
2025-05-07T20:31:49.4590698Z     compiled=False,
2025-05-07T20:31:49.4590908Z )
2025-05-07T20:31:49.4591223Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4591723Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:49.4591995Z 
2025-05-07T20:31:49.4592076Z     @given(
2025-05-07T20:31:49.4592307Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4592628Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4592942Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4593305Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4593638Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4593929Z     )
2025-05-07T20:31:49.4594279Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4594714Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4594961Z         self,
2025-05-07T20:31:49.4595159Z         T: int,
2025-05-07T20:31:49.4595349Z         D: int,
2025-05-07T20:31:49.4595569Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4595840Z         contiguous: bool,
2025-05-07T20:31:49.4596071Z         compiled: bool,
2025-05-07T20:31:49.4596296Z     ) -> None:
2025-05-07T20:31:49.4596509Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4596744Z     
2025-05-07T20:31:49.4597143Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4597488Z     
2025-05-07T20:31:49.4597673Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4597965Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4598279Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4598592Z         x0 = x[:, :D]
2025-05-07T20:31:49.4598811Z         x1 = x[:, D:]
2025-05-07T20:31:49.4599020Z     
2025-05-07T20:31:49.4599206Z         if contiguous:
2025-05-07T20:31:49.4599431Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4599694Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4599935Z     
2025-05-07T20:31:49.4600120Z         if scale_ub is not None:
2025-05-07T20:31:49.4600398Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4600734Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4601038Z             )
2025-05-07T20:31:49.4601239Z         else:
2025-05-07T20:31:49.4601465Z             scale_ub_tensor = None
2025-05-07T20:31:49.4601711Z     
2025-05-07T20:31:49.4601943Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4602256Z             op = silu_mul_quant
2025-05-07T20:31:49.4602502Z             if compiled:
2025-05-07T20:31:49.4602740Z                 op = torch.compile(op)
2025-05-07T20:31:49.4603037Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4603342Z     
2025-05-07T20:31:49.4603549Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.4603972Z 
2025-05-07T20:31:49.4604124Z moe/activation_test.py:117: 
2025-05-07T20:31:49.4604425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4604746Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.4605023Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4605709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.4606400Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.4606931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4607607Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4608262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4608790Z     kernel = self.compile(
2025-05-07T20:31:49.4609329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4609974Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4610364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4610592Z 
2025-05-07T20:31:49.4610795Z self = <triton.compiler.compiler.ASTSource object at 0x7fab68fbf190>
2025-05-07T20:31:49.4611880Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4613256Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faba0eadee0>}
2025-05-07T20:31:49.4614610Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4615636Z context = <triton._C.libtriton.ir.context object at 0x7fab6826ad70>
2025-05-07T20:31:49.4615929Z 
2025-05-07T20:31:49.4616095Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4616619Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4617309Z                            module_map=module_map)
2025-05-07T20:31:49.4617678Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4618028Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.4618288Z E       ^
2025-05-07T20:31:49.4618757Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4619341Z 
2025-05-07T20:31:49.4619756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4620268Z 
2025-05-07T20:31:49.4620370Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4620782Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4621173Z     T=128,
2025-05-07T20:31:49.4621360Z     D=7168,
2025-05-07T20:31:49.4621551Z     scale_ub=None,
2025-05-07T20:31:49.4621765Z     contiguous=False,
2025-05-07T20:31:49.4621982Z     compiled=True,
2025-05-07T20:31:49.4622187Z )
2025-05-07T20:31:49.4622505Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4622988Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:49.4623285Z 
2025-05-07T20:31:49.4623376Z     @given(
2025-05-07T20:31:49.4623609Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4623922Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4624229Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4624557Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4624880Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4625168Z     )
2025-05-07T20:31:49.4625519Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4625957Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4626191Z         self,
2025-05-07T20:31:49.4626382Z         T: int,
2025-05-07T20:31:49.4626577Z         D: int,
2025-05-07T20:31:49.4626790Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4627058Z         contiguous: bool,
2025-05-07T20:31:49.4627293Z         compiled: bool,
2025-05-07T20:31:49.4627506Z     ) -> None:
2025-05-07T20:31:49.4627718Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4627958Z     
2025-05-07T20:31:49.4628226Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4628565Z     
2025-05-07T20:31:49.4628755Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4629037Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4629342Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4629583Z         x0 = x[:, :D]
2025-05-07T20:31:49.4629792Z         x1 = x[:, D:]
2025-05-07T20:31:49.4630109Z     
2025-05-07T20:31:49.4630291Z         if contiguous:
2025-05-07T20:31:49.4630520Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4630776Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4631017Z     
2025-05-07T20:31:49.4631208Z         if scale_ub is not None:
2025-05-07T20:31:49.4631475Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4631808Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4632113Z             )
2025-05-07T20:31:49.4632298Z         else:
2025-05-07T20:31:49.4632508Z             scale_ub_tensor = None
2025-05-07T20:31:49.4632766Z     
2025-05-07T20:31:49.4632988Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4633302Z             op = silu_mul_quant
2025-05-07T20:31:49.4633577Z             if compiled:
2025-05-07T20:31:49.4633845Z                 op = torch.compile(op)
2025-05-07T20:31:49.4634143Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4634417Z     
2025-05-07T20:31:49.4634605Z         y_fp8, y_scale = fn()
2025-05-07T20:31:49.4634888Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:49.4635180Z     
2025-05-07T20:31:49.4635420Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4635838Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:49.4636130Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:49.4636444Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:49.4636795Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.4637182Z     
2025-05-07T20:31:49.4637385Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:49.4637576Z 
2025-05-07T20:31:49.4637673Z moe/activation_test.py:126: 
2025-05-07T20:31:49.4637967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4638295Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:49.4638621Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.4639400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:49.4640158Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:49.4640705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4641379Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4642064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:49.4642786Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.4643534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:49.4644273Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.4644997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:49.4645636Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:49.4646237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:49.4646746Z     fn()
2025-05-07T20:31:49.4647243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:49.4647829Z     self.fn.run(
2025-05-07T20:31:49.4648285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4648810Z     kernel = self.compile(
2025-05-07T20:31:49.4649342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4649990Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4650379Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4650612Z 
2025-05-07T20:31:49.4650822Z self = <triton.compiler.compiler.ASTSource object at 0x7fab68282a30>
2025-05-07T20:31:49.4651906Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4653503Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faba1a29700>}
2025-05-07T20:31:49.4655172Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4656217Z context = <triton._C.libtriton.ir.context object at 0x7faa04bc5d70>
2025-05-07T20:31:49.4656512Z 
2025-05-07T20:31:49.4656681Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4657321Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4657785Z                            module_map=module_map)
2025-05-07T20:31:49.4658154Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4658511Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:49.4658854Z E       ^
2025-05-07T20:31:49.4659312Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4659766Z 
2025-05-07T20:31:49.4660178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4660684Z 
2025-05-07T20:31:49.4660794Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4661199Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4661603Z     T=128,
2025-05-07T20:31:49.4661791Z     D=7168,
2025-05-07T20:31:49.4661985Z     scale_ub=None,
2025-05-07T20:31:49.4662202Z     contiguous=False,
2025-05-07T20:31:49.4662423Z     compiled=False,
2025-05-07T20:31:49.4662632Z )
2025-05-07T20:31:49.4662946Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4663560Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:49.4663900Z 
2025-05-07T20:31:49.4664003Z     @given(
2025-05-07T20:31:49.4664277Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4664664Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4665020Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4665343Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4665672Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4665956Z     )
2025-05-07T20:31:49.4666301Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4666731Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4666974Z         self,
2025-05-07T20:31:49.4667167Z         T: int,
2025-05-07T20:31:49.4667355Z         D: int,
2025-05-07T20:31:49.4667572Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4667842Z         contiguous: bool,
2025-05-07T20:31:49.4668070Z         compiled: bool,
2025-05-07T20:31:49.4668294Z     ) -> None:
2025-05-07T20:31:49.4668508Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4668738Z     
2025-05-07T20:31:49.4669009Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4669344Z     
2025-05-07T20:31:49.4669527Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4669812Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4670200Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4670433Z         x0 = x[:, :D]
2025-05-07T20:31:49.4670648Z         x1 = x[:, D:]
2025-05-07T20:31:49.4670852Z     
2025-05-07T20:31:49.4671039Z         if contiguous:
2025-05-07T20:31:49.4671260Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4671519Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4671762Z     
2025-05-07T20:31:49.4671944Z         if scale_ub is not None:
2025-05-07T20:31:49.4672213Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4672545Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4672848Z             )
2025-05-07T20:31:49.4673039Z         else:
2025-05-07T20:31:49.4673254Z             scale_ub_tensor = None
2025-05-07T20:31:49.4673499Z     
2025-05-07T20:31:49.4673727Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4674041Z             op = silu_mul_quant
2025-05-07T20:31:49.4674283Z             if compiled:
2025-05-07T20:31:49.4674527Z                 op = torch.compile(op)
2025-05-07T20:31:49.4674822Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4675088Z     
2025-05-07T20:31:49.4675275Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.4675443Z 
2025-05-07T20:31:49.4675542Z moe/activation_test.py:117: 
2025-05-07T20:31:49.4675921Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4676248Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.4676534Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4677218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.4677982Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.4678520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4679198Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4679851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4680385Z     kernel = self.compile(
2025-05-07T20:31:49.4680926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4681571Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4681966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4682196Z 
2025-05-07T20:31:49.4682400Z self = <triton.compiler.compiler.ASTSource object at 0x7faa04b5f850>
2025-05-07T20:31:49.4683537Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4684905Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fab681d7d30>}
2025-05-07T20:31:49.4686252Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4687281Z context = <triton._C.libtriton.ir.context object at 0x7fab68e0d330>
2025-05-07T20:31:49.4687567Z 
2025-05-07T20:31:49.4687737Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4688263Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4688728Z                            module_map=module_map)
2025-05-07T20:31:49.4689089Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4689443Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.4689695Z E       ^
2025-05-07T20:31:49.4690167Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4690619Z 
2025-05-07T20:31:49.4691040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4691552Z 
2025-05-07T20:31:49.4691662Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4692070Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4692465Z     T=4096,
2025-05-07T20:31:49.4692650Z     D=5120,
2025-05-07T20:31:49.4692831Z     scale_ub=1200.0,
2025-05-07T20:31:49.4693056Z     contiguous=True,
2025-05-07T20:31:49.4693277Z     compiled=False,
2025-05-07T20:31:49.4693501Z )
2025-05-07T20:31:49.4693840Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4694329Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:49.4694601Z 
2025-05-07T20:31:49.4694678Z     @given(
2025-05-07T20:31:49.4694899Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4695211Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4695514Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4695917Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4696245Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4696525Z     )
2025-05-07T20:31:49.4696864Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4697298Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4697615Z         self,
2025-05-07T20:31:49.4697796Z         T: int,
2025-05-07T20:31:49.4697995Z         D: int,
2025-05-07T20:31:49.4698214Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4698475Z         contiguous: bool,
2025-05-07T20:31:49.4698713Z         compiled: bool,
2025-05-07T20:31:49.4698932Z     ) -> None:
2025-05-07T20:31:49.4699144Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4699380Z     
2025-05-07T20:31:49.4699650Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4699990Z     
2025-05-07T20:31:49.4700174Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4700469Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4700772Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4701005Z         x0 = x[:, :D]
2025-05-07T20:31:49.4701220Z         x1 = x[:, D:]
2025-05-07T20:31:49.4701423Z     
2025-05-07T20:31:49.4701604Z         if contiguous:
2025-05-07T20:31:49.4701831Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4702094Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4702324Z     
2025-05-07T20:31:49.4702517Z         if scale_ub is not None:
2025-05-07T20:31:49.4702783Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4703108Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4703412Z             )
2025-05-07T20:31:49.4703601Z         else:
2025-05-07T20:31:49.4704022Z             scale_ub_tensor = None
2025-05-07T20:31:49.4704288Z     
2025-05-07T20:31:49.4704520Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4704831Z             op = silu_mul_quant
2025-05-07T20:31:49.4705070Z             if compiled:
2025-05-07T20:31:49.4705321Z                 op = torch.compile(op)
2025-05-07T20:31:49.4705616Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4705880Z     
2025-05-07T20:31:49.4706063Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.4706226Z 
2025-05-07T20:31:49.4706331Z moe/activation_test.py:117: 
2025-05-07T20:31:49.4706624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4706952Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.4707230Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4707917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.4708593Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.4709126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4709810Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4710510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4711036Z     kernel = self.compile(
2025-05-07T20:31:49.4711567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4712222Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4712605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4712833Z 
2025-05-07T20:31:49.4713042Z self = <triton.compiler.compiler.ASTSource object at 0x7faa04c01e80>
2025-05-07T20:31:49.4714165Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4715671Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fab69195940>}
2025-05-07T20:31:49.4717003Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4718154Z context = <triton._C.libtriton.ir.context object at 0x7faa04b94470>
2025-05-07T20:31:49.4718441Z 
2025-05-07T20:31:49.4718604Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4719126Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4719583Z                            module_map=module_map)
2025-05-07T20:31:49.4719946Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4720296Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.4720552Z E       ^
2025-05-07T20:31:49.4721014Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4721469Z 
2025-05-07T20:31:49.4721882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4722393Z 
2025-05-07T20:31:49.4722498Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4722902Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4723324Z     T=1,
2025-05-07T20:31:49.4723528Z     D=5120,
2025-05-07T20:31:49.4723714Z     scale_ub=None,
2025-05-07T20:31:49.4723919Z     contiguous=True,
2025-05-07T20:31:49.4724136Z     compiled=True,
2025-05-07T20:31:49.4724332Z )
2025-05-07T20:31:49.4724642Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4725120Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:49.4725376Z 
2025-05-07T20:31:49.4725461Z     @given(
2025-05-07T20:31:49.4725679Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4725982Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4726282Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4726599Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4726928Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4727214Z     )
2025-05-07T20:31:49.4727556Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4727982Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4728224Z         self,
2025-05-07T20:31:49.4728413Z         T: int,
2025-05-07T20:31:49.4728601Z         D: int,
2025-05-07T20:31:49.4728811Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4729076Z         contiguous: bool,
2025-05-07T20:31:49.4729303Z         compiled: bool,
2025-05-07T20:31:49.4729524Z     ) -> None:
2025-05-07T20:31:49.4734618Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4734870Z     
2025-05-07T20:31:49.4735134Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4735472Z     
2025-05-07T20:31:49.4735661Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4735942Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4736254Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4736490Z         x0 = x[:, :D]
2025-05-07T20:31:49.4736696Z         x1 = x[:, D:]
2025-05-07T20:31:49.4736893Z     
2025-05-07T20:31:49.4737071Z         if contiguous:
2025-05-07T20:31:49.4737289Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4737546Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4737781Z     
2025-05-07T20:31:49.4737963Z         if scale_ub is not None:
2025-05-07T20:31:49.4738234Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4738568Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4738867Z             )
2025-05-07T20:31:49.4739159Z         else:
2025-05-07T20:31:49.4739367Z             scale_ub_tensor = None
2025-05-07T20:31:49.4739439Z     
2025-05-07T20:31:49.4739577Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4739663Z             op = silu_mul_quant
2025-05-07T20:31:49.4739746Z             if compiled:
2025-05-07T20:31:49.4739933Z                 op = torch.compile(op)
2025-05-07T20:31:49.4740042Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4740112Z     
2025-05-07T20:31:49.4740205Z         y_fp8, y_scale = fn()
2025-05-07T20:31:49.4740325Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:49.4740396Z     
2025-05-07T20:31:49.4740532Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4740631Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:49.4740729Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:49.4740854Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:49.4740998Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.4741072Z     
2025-05-07T20:31:49.4741170Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:49.4741175Z 
2025-05-07T20:31:49.4741273Z moe/activation_test.py:126: 
2025-05-07T20:31:49.4741407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4741513Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:49.4741646Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.4742219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:49.4742317Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:49.4742678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4742904Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4743301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:49.4743573Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.4743965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:49.4744221Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.4744586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:49.4744750Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:49.4745087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:49.4745162Z     fn()
2025-05-07T20:31:49.4745559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:49.4745642Z     self.fn.run(
2025-05-07T20:31:49.4745969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4746063Z     kernel = self.compile(
2025-05-07T20:31:49.4746440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4746616Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4746747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4746752Z 
2025-05-07T20:31:49.4746956Z self = <triton.compiler.compiler.ASTSource object at 0x7faba5b4a370>
2025-05-07T20:31:49.4747824Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4748332Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fab68cea820>}
2025-05-07T20:31:49.4749079Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4749345Z context = <triton._C.libtriton.ir.context object at 0x7faa0473f1f0>
2025-05-07T20:31:49.4749350Z 
2025-05-07T20:31:49.4749511Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4749775Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4749943Z                            module_map=module_map)
2025-05-07T20:31:49.4750104Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4750213Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:49.4750288Z E       ^
2025-05-07T20:31:49.4750637Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4750647Z 
2025-05-07T20:31:49.4751056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4751066Z 
2025-05-07T20:31:49.4751167Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4751388Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4751463Z     T=2048,
2025-05-07T20:31:49.4751537Z     D=5120,
2025-05-07T20:31:49.4751618Z     scale_ub=None,
2025-05-07T20:31:49.4751700Z     contiguous=True,
2025-05-07T20:31:49.4751779Z     compiled=True,
2025-05-07T20:31:49.4751853Z )
2025-05-07T20:31:49.4752065Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4752240Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:49.4752245Z 
2025-05-07T20:31:49.4752317Z     @given(
2025-05-07T20:31:49.4752432Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4752530Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4752647Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4752757Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4752869Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4752940Z     )
2025-05-07T20:31:49.4753186Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4753277Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4753368Z         self,
2025-05-07T20:31:49.4753446Z         T: int,
2025-05-07T20:31:49.4753538Z         D: int,
2025-05-07T20:31:49.4753639Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4753728Z         contiguous: bool,
2025-05-07T20:31:49.4753815Z         compiled: bool,
2025-05-07T20:31:49.4753888Z     ) -> None:
2025-05-07T20:31:49.4753986Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4754056Z     
2025-05-07T20:31:49.4754223Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4754295Z     
2025-05-07T20:31:49.4754389Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4754512Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4754596Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4754671Z         x0 = x[:, :D]
2025-05-07T20:31:49.4754746Z         x1 = x[:, D:]
2025-05-07T20:31:49.4754816Z     
2025-05-07T20:31:49.4754894Z         if contiguous:
2025-05-07T20:31:49.4754985Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4755068Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4755136Z     
2025-05-07T20:31:49.4755230Z         if scale_ub is not None:
2025-05-07T20:31:49.4755334Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4755551Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4755626Z             )
2025-05-07T20:31:49.4755700Z         else:
2025-05-07T20:31:49.4755793Z             scale_ub_tensor = None
2025-05-07T20:31:49.4755863Z     
2025-05-07T20:31:49.4755988Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4756151Z             op = silu_mul_quant
2025-05-07T20:31:49.4756229Z             if compiled:
2025-05-07T20:31:49.4756324Z                 op = torch.compile(op)
2025-05-07T20:31:49.4756429Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4756501Z     
2025-05-07T20:31:49.4756585Z         y_fp8, y_scale = fn()
2025-05-07T20:31:49.4756708Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:49.4756776Z     
2025-05-07T20:31:49.4756910Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4757006Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:49.4757100Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:49.4757228Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:49.4757364Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.4757436Z     
2025-05-07T20:31:49.4757537Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:49.4757541Z 
2025-05-07T20:31:49.4757639Z moe/activation_test.py:126: 
2025-05-07T20:31:49.4757766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4757869Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:49.4757998Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.4758556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:49.4758654Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:49.4759008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4759232Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4759591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:49.4759844Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.4760241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:49.4760489Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.4760857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:49.4761020Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:49.4761361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:49.4761442Z     fn()
2025-05-07T20:31:49.4761834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:49.4761917Z     self.fn.run(
2025-05-07T20:31:49.4762248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4762341Z     kernel = self.compile(
2025-05-07T20:31:49.4762717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4762892Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4763040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4763045Z 
2025-05-07T20:31:49.4763272Z self = <triton.compiler.compiler.ASTSource object at 0x7faba19267c0>
2025-05-07T20:31:49.4764149Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4764659Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa04c8d940>}
2025-05-07T20:31:49.4765494Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4765687Z context = <triton._C.libtriton.ir.context object at 0x7faa044e8170>
2025-05-07T20:31:49.4765691Z 
2025-05-07T20:31:49.4765851Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4766107Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4766218Z                            module_map=module_map)
2025-05-07T20:31:49.4766376Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4766475Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:49.4766544Z E       ^
2025-05-07T20:31:49.4766894Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4766904Z 
2025-05-07T20:31:49.4767315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4767320Z 
2025-05-07T20:31:49.4767419Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4767641Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4767720Z     T=128,
2025-05-07T20:31:49.4767791Z     D=5120,
2025-05-07T20:31:49.4767872Z     scale_ub=None,
2025-05-07T20:31:49.4767950Z     contiguous=True,
2025-05-07T20:31:49.4768026Z     compiled=True,
2025-05-07T20:31:49.4768100Z )
2025-05-07T20:31:49.4768315Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4768479Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:49.4768483Z 
2025-05-07T20:31:49.4768565Z     @given(
2025-05-07T20:31:49.4768681Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4768779Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4768896Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4769009Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4769121Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4769192Z     )
2025-05-07T20:31:49.4769431Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4769527Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4769595Z         self,
2025-05-07T20:31:49.4769666Z         T: int,
2025-05-07T20:31:49.4769740Z         D: int,
2025-05-07T20:31:49.4769838Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4772668Z         contiguous: bool,
2025-05-07T20:31:49.4772758Z         compiled: bool,
2025-05-07T20:31:49.4772834Z     ) -> None:
2025-05-07T20:31:49.4772934Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4773005Z     
2025-05-07T20:31:49.4773197Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4773288Z     
2025-05-07T20:31:49.4773393Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4773512Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4773600Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4773677Z         x0 = x[:, :D]
2025-05-07T20:31:49.4773753Z         x1 = x[:, D:]
2025-05-07T20:31:49.4773827Z     
2025-05-07T20:31:49.4773906Z         if contiguous:
2025-05-07T20:31:49.4773995Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4774084Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4774151Z     
2025-05-07T20:31:49.4774240Z         if scale_ub is not None:
2025-05-07T20:31:49.4774444Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4774579Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4774657Z             )
2025-05-07T20:31:49.4774726Z         else:
2025-05-07T20:31:49.4774818Z             scale_ub_tensor = None
2025-05-07T20:31:49.4774966Z     
2025-05-07T20:31:49.4775095Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4775179Z             op = silu_mul_quant
2025-05-07T20:31:49.4775262Z             if compiled:
2025-05-07T20:31:49.4775359Z                 op = torch.compile(op)
2025-05-07T20:31:49.4775461Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4775537Z     
2025-05-07T20:31:49.4775623Z         y_fp8, y_scale = fn()
2025-05-07T20:31:49.4775745Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:49.4775817Z     
2025-05-07T20:31:49.4775950Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4776057Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:49.4776151Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:49.4776269Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:49.4776406Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.4776474Z     
2025-05-07T20:31:49.4776571Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:49.4776576Z 
2025-05-07T20:31:49.4776679Z moe/activation_test.py:126: 
2025-05-07T20:31:49.4776802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4776907Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:49.4777036Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.4777600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:49.4777697Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:49.4778056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4778274Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4778635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:49.4778890Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.4779287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:49.4779536Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.4779901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:49.4780069Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:49.4780409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:49.4780489Z     fn()
2025-05-07T20:31:49.4780887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:49.4780966Z     self.fn.run(
2025-05-07T20:31:49.4781307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4781396Z     kernel = self.compile(
2025-05-07T20:31:49.4781772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4781946Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4782070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4782074Z 
2025-05-07T20:31:49.4782281Z self = <triton.compiler.compiler.ASTSource object at 0x7fab800559a0>
2025-05-07T20:31:49.4783144Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4783690Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa04853c10>}
2025-05-07T20:31:49.4784530Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4784723Z context = <triton._C.libtriton.ir.context object at 0x7faa03f3ecb0>
2025-05-07T20:31:49.4784727Z 
2025-05-07T20:31:49.4784893Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4785160Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4785264Z                            module_map=module_map)
2025-05-07T20:31:49.4785428Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4785525Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:49.4785600Z E       ^
2025-05-07T20:31:49.4785959Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4785963Z 
2025-05-07T20:31:49.4786375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4786380Z 
2025-05-07T20:31:49.4786482Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4786700Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4786778Z     T=4096,
2025-05-07T20:31:49.4786850Z     D=5120,
2025-05-07T20:31:49.4786926Z     scale_ub=None,
2025-05-07T20:31:49.4787010Z     contiguous=True,
2025-05-07T20:31:49.4787094Z     compiled=True,
2025-05-07T20:31:49.4787166Z )
2025-05-07T20:31:49.4787386Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4787554Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:49.4787558Z 
2025-05-07T20:31:49.4787637Z     @given(
2025-05-07T20:31:49.4787759Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4787854Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4787970Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4788083Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4788192Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4788263Z     )
2025-05-07T20:31:49.4788505Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4788593Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4788670Z         self,
2025-05-07T20:31:49.4788745Z         T: int,
2025-05-07T20:31:49.4788816Z         D: int,
2025-05-07T20:31:49.4788915Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4788999Z         contiguous: bool,
2025-05-07T20:31:49.4789078Z         compiled: bool,
2025-05-07T20:31:49.4789155Z     ) -> None:
2025-05-07T20:31:49.4789245Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4789322Z     
2025-05-07T20:31:49.4789487Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4789558Z     
2025-05-07T20:31:49.4789652Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4789771Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4789934Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4790014Z         x0 = x[:, :D]
2025-05-07T20:31:49.4790092Z         x1 = x[:, D:]
2025-05-07T20:31:49.4790160Z     
2025-05-07T20:31:49.4790243Z         if contiguous:
2025-05-07T20:31:49.4790330Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4790416Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4790576Z     
2025-05-07T20:31:49.4790666Z         if scale_ub is not None:
2025-05-07T20:31:49.4790764Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4790899Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4790971Z             )
2025-05-07T20:31:49.4791125Z         else:
2025-05-07T20:31:49.4791215Z             scale_ub_tensor = None
2025-05-07T20:31:49.4791285Z     
2025-05-07T20:31:49.4791412Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4791499Z             op = silu_mul_quant
2025-05-07T20:31:49.4791579Z             if compiled:
2025-05-07T20:31:49.4791680Z                 op = torch.compile(op)
2025-05-07T20:31:49.4791782Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4791852Z     
2025-05-07T20:31:49.4791941Z         y_fp8, y_scale = fn()
2025-05-07T20:31:49.4792058Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:49.4792127Z     
2025-05-07T20:31:49.4792272Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4792370Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:49.4792468Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:49.4792586Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:49.4792728Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.4792800Z     
2025-05-07T20:31:49.4792897Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:49.4792901Z 
2025-05-07T20:31:49.4792997Z moe/activation_test.py:126: 
2025-05-07T20:31:49.4793125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4793229Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:49.4793363Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.4793923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:49.4794024Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:49.4794382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4794600Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4794964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:49.4795222Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.4795613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:49.4795865Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.4796232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:49.4796399Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:49.4796737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:49.4796810Z     fn()
2025-05-07T20:31:49.4797208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:49.4797289Z     self.fn.run(
2025-05-07T20:31:49.4797619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4797715Z     kernel = self.compile(
2025-05-07T20:31:49.4798091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4798262Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4798388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4798393Z 
2025-05-07T20:31:49.4798708Z self = <triton.compiler.compiler.ASTSource object at 0x7faa04c014c0>
2025-05-07T20:31:49.4799493Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4800073Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa042d21f0>}
2025-05-07T20:31:49.4800824Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4801014Z context = <triton._C.libtriton.ir.context object at 0x7faa03929b70>
2025-05-07T20:31:49.4801019Z 
2025-05-07T20:31:49.4801186Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4801452Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4801559Z                            module_map=module_map)
2025-05-07T20:31:49.4801723Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4801832Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:49.4801908Z E       ^
2025-05-07T20:31:49.4802263Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4802267Z 
2025-05-07T20:31:49.4802676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4802681Z 
2025-05-07T20:31:49.4802779Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4803027Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4803102Z     T=16384,
2025-05-07T20:31:49.4803203Z     D=5120,
2025-05-07T20:31:49.4803280Z     scale_ub=None,
2025-05-07T20:31:49.4803359Z     contiguous=True,
2025-05-07T20:31:49.4803440Z     compiled=True,
2025-05-07T20:31:49.4803510Z )
2025-05-07T20:31:49.4803911Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4804170Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:49.4804187Z 
2025-05-07T20:31:49.4804281Z     @given(
2025-05-07T20:31:49.4804398Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4804498Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4804611Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4804729Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4804839Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4804911Z     )
2025-05-07T20:31:49.4805158Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4805254Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4805325Z         self,
2025-05-07T20:31:49.4805400Z         T: int,
2025-05-07T20:31:49.4805470Z         D: int,
2025-05-07T20:31:49.4805568Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4805654Z         contiguous: bool,
2025-05-07T20:31:49.4805736Z         compiled: bool,
2025-05-07T20:31:49.4805812Z     ) -> None:
2025-05-07T20:31:49.4805906Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4805975Z     
2025-05-07T20:31:49.4806141Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4806214Z     
2025-05-07T20:31:49.4806300Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4806421Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4806503Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4806580Z         x0 = x[:, :D]
2025-05-07T20:31:49.4806658Z         x1 = x[:, D:]
2025-05-07T20:31:49.4806726Z     
2025-05-07T20:31:49.4806803Z         if contiguous:
2025-05-07T20:31:49.4807039Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4807126Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4807196Z     
2025-05-07T20:31:49.4807287Z         if scale_ub is not None:
2025-05-07T20:31:49.4807387Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4807520Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4807710Z             )
2025-05-07T20:31:49.4807783Z         else:
2025-05-07T20:31:49.4807876Z             scale_ub_tensor = None
2025-05-07T20:31:49.4807945Z     
2025-05-07T20:31:49.4808070Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4808158Z             op = silu_mul_quant
2025-05-07T20:31:49.4808238Z             if compiled:
2025-05-07T20:31:49.4808334Z                 op = torch.compile(op)
2025-05-07T20:31:49.4808439Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4808511Z     
2025-05-07T20:31:49.4808598Z         y_fp8, y_scale = fn()
2025-05-07T20:31:49.4808724Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:49.4808793Z     
2025-05-07T20:31:49.4808924Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4809024Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:49.4809120Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:49.4809247Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:49.4809384Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.4809456Z     
2025-05-07T20:31:49.4809556Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:49.4809561Z 
2025-05-07T20:31:49.4809653Z moe/activation_test.py:126: 
2025-05-07T20:31:49.4809776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4809881Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:49.4810009Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.4810585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:49.4810681Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:49.4811038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4811264Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4811624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:49.4811877Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.4812274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:49.4812522Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.4812896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:49.4813058Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:49.4813432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:49.4813525Z     fn()
2025-05-07T20:31:49.4813916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:49.4813999Z     self.fn.run(
2025-05-07T20:31:49.4814330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4814418Z     kernel = self.compile(
2025-05-07T20:31:49.4814792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4814963Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4815166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4815179Z 
2025-05-07T20:31:49.4815383Z self = <triton.compiler.compiler.ASTSource object at 0x7faa04761730>
2025-05-07T20:31:49.4816157Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4816817Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa049af790>}
2025-05-07T20:31:49.4817560Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4817752Z context = <triton._C.libtriton.ir.context object at 0x7faa032f90b0>
2025-05-07T20:31:49.4817761Z 
2025-05-07T20:31:49.4817924Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4818187Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4818292Z                            module_map=module_map)
2025-05-07T20:31:49.4818457Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4818562Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:49.4818637Z E       ^
2025-05-07T20:31:49.4818995Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4819002Z 
2025-05-07T20:31:49.4819411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4819415Z 
2025-05-07T20:31:49.4819516Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4819743Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4819814Z     T=1,
2025-05-07T20:31:49.4819886Z     D=5120,
2025-05-07T20:31:49.4819970Z     scale_ub=1200.0,
2025-05-07T20:31:49.4820051Z     contiguous=True,
2025-05-07T20:31:49.4820132Z     compiled=True,
2025-05-07T20:31:49.4820205Z )
2025-05-07T20:31:49.4820416Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4820588Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:49.4820592Z 
2025-05-07T20:31:49.4820664Z     @given(
2025-05-07T20:31:49.4820781Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4820882Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4820991Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4821102Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4821219Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4821289Z     )
2025-05-07T20:31:49.4821536Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4821630Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4821702Z         self,
2025-05-07T20:31:49.4821779Z         T: int,
2025-05-07T20:31:49.4821852Z         D: int,
2025-05-07T20:31:49.4821944Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4822035Z         contiguous: bool,
2025-05-07T20:31:49.4822114Z         compiled: bool,
2025-05-07T20:31:49.4822187Z     ) -> None:
2025-05-07T20:31:49.4822280Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4822348Z     
2025-05-07T20:31:49.4822511Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4822588Z     
2025-05-07T20:31:49.4822674Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4822792Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4822877Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4822951Z         x0 = x[:, :D]
2025-05-07T20:31:49.4823027Z         x1 = x[:, D:]
2025-05-07T20:31:49.4823178Z     
2025-05-07T20:31:49.4823260Z         if contiguous:
2025-05-07T20:31:49.4823349Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4823434Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4823503Z     
2025-05-07T20:31:49.4823596Z         if scale_ub is not None:
2025-05-07T20:31:49.4823792Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4823923Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4823998Z             )
2025-05-07T20:31:49.4824072Z         else:
2025-05-07T20:31:49.4824163Z             scale_ub_tensor = None
2025-05-07T20:31:49.4824234Z     
2025-05-07T20:31:49.4824358Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4824447Z             op = silu_mul_quant
2025-05-07T20:31:49.4824529Z             if compiled:
2025-05-07T20:31:49.4824624Z                 op = torch.compile(op)
2025-05-07T20:31:49.4824727Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4824801Z     
2025-05-07T20:31:49.4824886Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.4824891Z 
2025-05-07T20:31:49.4824986Z moe/activation_test.py:117: 
2025-05-07T20:31:49.4825110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4825205Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.4825306Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4825670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.4825761Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.4826252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.4826346Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.4826701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4826923Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4827256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4827348Z     kernel = self.compile(
2025-05-07T20:31:49.4827722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4827901Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4828023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4828027Z 
2025-05-07T20:31:49.4828228Z self = <triton.compiler.compiler.ASTSource object at 0x7faa03eaa220>
2025-05-07T20:31:49.4829003Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4829509Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa049af310>}
2025-05-07T20:31:49.4830323Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4830520Z context = <triton._C.libtriton.ir.context object at 0x7faa031dddf0>
2025-05-07T20:31:49.4830524Z 
2025-05-07T20:31:49.4830690Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4830948Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4831054Z                            module_map=module_map)
2025-05-07T20:31:49.4831218Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4831314Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.4831472Z E       ^
2025-05-07T20:31:49.4831830Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4831835Z 
2025-05-07T20:31:49.4832246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4832329Z 
2025-05-07T20:31:49.4832430Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4832651Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4832724Z     T=1,
2025-05-07T20:31:49.4832800Z     D=5120,
2025-05-07T20:31:49.4832878Z     scale_ub=None,
2025-05-07T20:31:49.4832962Z     contiguous=False,
2025-05-07T20:31:49.4833046Z     compiled=True,
2025-05-07T20:31:49.4833134Z )
2025-05-07T20:31:49.4833379Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4833544Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:49.4833553Z 
2025-05-07T20:31:49.4833626Z     @given(
2025-05-07T20:31:49.4833742Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4833836Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4833946Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4834067Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4834178Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4834250Z     )
2025-05-07T20:31:49.4834493Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4834581Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4834655Z         self,
2025-05-07T20:31:49.4834730Z         T: int,
2025-05-07T20:31:49.4834801Z         D: int,
2025-05-07T20:31:49.4834900Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4834986Z         contiguous: bool,
2025-05-07T20:31:49.4835067Z         compiled: bool,
2025-05-07T20:31:49.4835140Z     ) -> None:
2025-05-07T20:31:49.4835234Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4835301Z     
2025-05-07T20:31:49.4835469Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4835540Z     
2025-05-07T20:31:49.4835628Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4835755Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4835844Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4835918Z         x0 = x[:, :D]
2025-05-07T20:31:49.4835996Z         x1 = x[:, D:]
2025-05-07T20:31:49.4836063Z     
2025-05-07T20:31:49.4836146Z         if contiguous:
2025-05-07T20:31:49.4836232Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4836318Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4836389Z     
2025-05-07T20:31:49.4836477Z         if scale_ub is not None:
2025-05-07T20:31:49.4836577Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4836710Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4836787Z             )
2025-05-07T20:31:49.4836860Z         else:
2025-05-07T20:31:49.4836955Z             scale_ub_tensor = None
2025-05-07T20:31:49.4837021Z     
2025-05-07T20:31:49.4837144Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4837232Z             op = silu_mul_quant
2025-05-07T20:31:49.4837319Z             if compiled:
2025-05-07T20:31:49.4837415Z                 op = torch.compile(op)
2025-05-07T20:31:49.4837522Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4837591Z     
2025-05-07T20:31:49.4837681Z         y_fp8, y_scale = fn()
2025-05-07T20:31:49.4837798Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:49.4837866Z     
2025-05-07T20:31:49.4837998Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4838096Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:49.4838191Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:49.4838396Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:49.4838533Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.4838604Z     
2025-05-07T20:31:49.4838702Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:49.4838707Z 
2025-05-07T20:31:49.4838800Z moe/activation_test.py:126: 
2025-05-07T20:31:49.4839002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4839105Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:49.4839233Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.4839790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:49.4839886Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:49.4840239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4840470Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4840828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:49.4841081Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.4841481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:49.4841732Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.4842102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:49.4842267Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:49.4842606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:49.4842679Z     fn()
2025-05-07T20:31:49.4843078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:49.4843164Z     self.fn.run(
2025-05-07T20:31:49.4843533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4843637Z     kernel = self.compile(
2025-05-07T20:31:49.4844013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4844187Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4844312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4844317Z 
2025-05-07T20:31:49.4844520Z self = <triton.compiler.compiler.ASTSource object at 0x7faa031c4b80>
2025-05-07T20:31:49.4845303Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4845812Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa03b8ff70>}
2025-05-07T20:31:49.4846552Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4846749Z context = <triton._C.libtriton.ir.context object at 0x7faa0321e8b0>
2025-05-07T20:31:49.4846754Z 
2025-05-07T20:31:49.4846913Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4847176Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4847278Z                            module_map=module_map)
2025-05-07T20:31:49.4847514Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4847620Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:49.4847694Z E       ^
2025-05-07T20:31:49.4848049Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4848053Z 
2025-05-07T20:31:49.4848538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4848543Z 
2025-05-07T20:31:49.4848643Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4848868Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4848942Z     T=1,
2025-05-07T20:31:49.4849013Z     D=5120,
2025-05-07T20:31:49.4849094Z     scale_ub=None,
2025-05-07T20:31:49.4849176Z     contiguous=True,
2025-05-07T20:31:49.4849255Z     compiled=False,
2025-05-07T20:31:49.4849327Z )
2025-05-07T20:31:49.4849548Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4849708Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:49.4849717Z 
2025-05-07T20:31:49.4849792Z     @given(
2025-05-07T20:31:49.4849907Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4850003Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4850118Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4850230Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4850343Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4850415Z     )
2025-05-07T20:31:49.4850655Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4850754Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4850827Z         self,
2025-05-07T20:31:49.4850896Z         T: int,
2025-05-07T20:31:49.4850972Z         D: int,
2025-05-07T20:31:49.4851067Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4851154Z         contiguous: bool,
2025-05-07T20:31:49.4851238Z         compiled: bool,
2025-05-07T20:31:49.4851313Z     ) -> None:
2025-05-07T20:31:49.4851407Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4851477Z     
2025-05-07T20:31:49.4851643Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4851718Z     
2025-05-07T20:31:49.4851804Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4851922Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4852007Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4852081Z         x0 = x[:, :D]
2025-05-07T20:31:49.4852155Z         x1 = x[:, D:]
2025-05-07T20:31:49.4852229Z     
2025-05-07T20:31:49.4852306Z         if contiguous:
2025-05-07T20:31:49.4852393Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4852481Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4852549Z     
2025-05-07T20:31:49.4852639Z         if scale_ub is not None:
2025-05-07T20:31:49.4852737Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4852872Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4852950Z             )
2025-05-07T20:31:49.4853024Z         else:
2025-05-07T20:31:49.4853115Z             scale_ub_tensor = None
2025-05-07T20:31:49.4853187Z     
2025-05-07T20:31:49.4853313Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4853402Z             op = silu_mul_quant
2025-05-07T20:31:49.4853485Z             if compiled:
2025-05-07T20:31:49.4853580Z                 op = torch.compile(op)
2025-05-07T20:31:49.4853682Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4853754Z     
2025-05-07T20:31:49.4853841Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.4853846Z 
2025-05-07T20:31:49.4853941Z moe/activation_test.py:117: 
2025-05-07T20:31:49.4854065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4854159Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.4854343Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4854851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.4854948Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.4855311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4855631Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4855969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4856060Z     kernel = self.compile(
2025-05-07T20:31:49.4861140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4861338Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4861474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4861480Z 
2025-05-07T20:31:49.4861686Z self = <triton.compiler.compiler.ASTSource object at 0x7faa036b83a0>
2025-05-07T20:31:49.4862463Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4863079Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa04853e50>}
2025-05-07T20:31:49.4864010Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4864252Z context = <triton._C.libtriton.ir.context object at 0x7faa035cccf0>
2025-05-07T20:31:49.4864258Z 
2025-05-07T20:31:49.4864464Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4864790Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4864921Z                            module_map=module_map)
2025-05-07T20:31:49.4865118Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4865245Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.4865338Z E       ^
2025-05-07T20:31:49.4865707Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4865716Z 
2025-05-07T20:31:49.4866125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4866130Z 
2025-05-07T20:31:49.4866228Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4866450Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4866526Z     T=128,
2025-05-07T20:31:49.4866597Z     D=5120,
2025-05-07T20:31:49.4866679Z     scale_ub=None,
2025-05-07T20:31:49.4866761Z     contiguous=False,
2025-05-07T20:31:49.4866838Z     compiled=True,
2025-05-07T20:31:49.4866911Z )
2025-05-07T20:31:49.4867123Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4867297Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:49.4867302Z 
2025-05-07T20:31:49.4867375Z     @given(
2025-05-07T20:31:49.4867489Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4867591Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4867701Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4867812Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4867928Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4868002Z     )
2025-05-07T20:31:49.4868345Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4868438Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4868512Z         self,
2025-05-07T20:31:49.4868585Z         T: int,
2025-05-07T20:31:49.4868656Z         D: int,
2025-05-07T20:31:49.4868751Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4868841Z         contiguous: bool,
2025-05-07T20:31:49.4868995Z         compiled: bool,
2025-05-07T20:31:49.4869066Z     ) -> None:
2025-05-07T20:31:49.4869161Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4869231Z     
2025-05-07T20:31:49.4869398Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4869473Z     
2025-05-07T20:31:49.4869561Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4869682Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4869770Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4869907Z         x0 = x[:, :D]
2025-05-07T20:31:49.4869991Z         x1 = x[:, D:]
2025-05-07T20:31:49.4870060Z     
2025-05-07T20:31:49.4870144Z         if contiguous:
2025-05-07T20:31:49.4870233Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4870319Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4870389Z     
2025-05-07T20:31:49.4870477Z         if scale_ub is not None:
2025-05-07T20:31:49.4870581Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4870717Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4870801Z             )
2025-05-07T20:31:49.4870876Z         else:
2025-05-07T20:31:49.4870965Z             scale_ub_tensor = None
2025-05-07T20:31:49.4871035Z     
2025-05-07T20:31:49.4871159Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4871243Z             op = silu_mul_quant
2025-05-07T20:31:49.4871330Z             if compiled:
2025-05-07T20:31:49.4871426Z                 op = torch.compile(op)
2025-05-07T20:31:49.4871533Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4871601Z     
2025-05-07T20:31:49.4871691Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.4871695Z 
2025-05-07T20:31:49.4871791Z moe/activation_test.py:117: 
2025-05-07T20:31:49.4871915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4872011Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.4872107Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4872478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.4872574Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.4873188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.4873304Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.4873749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4874028Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4874443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4874557Z     kernel = self.compile(
2025-05-07T20:31:49.4874998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4875175Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4875296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4875301Z 
2025-05-07T20:31:49.4875504Z self = <triton.compiler.compiler.ASTSource object at 0x7faa0440ecd0>
2025-05-07T20:31:49.4876278Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4876861Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa03bb00d0>}
2025-05-07T20:31:49.4877607Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4877869Z context = <triton._C.libtriton.ir.context object at 0x7faa02cd9af0>
2025-05-07T20:31:49.4877873Z 
2025-05-07T20:31:49.4878039Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4878297Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4878401Z                            module_map=module_map)
2025-05-07T20:31:49.4878563Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4878657Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.4878733Z E       ^
2025-05-07T20:31:49.4879090Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4879095Z 
2025-05-07T20:31:49.4879504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4879513Z 
2025-05-07T20:31:49.4879621Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4879840Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4879913Z     T=128,
2025-05-07T20:31:49.4879988Z     D=7168,
2025-05-07T20:31:49.4880066Z     scale_ub=1200.0,
2025-05-07T20:31:49.4880149Z     contiguous=False,
2025-05-07T20:31:49.4880230Z     compiled=False,
2025-05-07T20:31:49.4880299Z )
2025-05-07T20:31:49.4880511Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4880682Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:49.4880687Z 
2025-05-07T20:31:49.4880766Z     @given(
2025-05-07T20:31:49.4880884Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4880977Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4881090Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4881207Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4881322Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4881392Z     )
2025-05-07T20:31:49.4881638Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4881726Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4881798Z         self,
2025-05-07T20:31:49.4881871Z         T: int,
2025-05-07T20:31:49.4881940Z         D: int,
2025-05-07T20:31:49.4882036Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4882122Z         contiguous: bool,
2025-05-07T20:31:49.4882201Z         compiled: bool,
2025-05-07T20:31:49.4882276Z     ) -> None:
2025-05-07T20:31:49.4882370Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4882442Z     
2025-05-07T20:31:49.4882611Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4882682Z     
2025-05-07T20:31:49.4882774Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4882894Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4883002Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4883085Z         x0 = x[:, :D]
2025-05-07T20:31:49.4883176Z         x1 = x[:, D:]
2025-05-07T20:31:49.4883254Z     
2025-05-07T20:31:49.4883333Z         if contiguous:
2025-05-07T20:31:49.4883421Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4883505Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4883578Z     
2025-05-07T20:31:49.4883666Z         if scale_ub is not None:
2025-05-07T20:31:49.4883764Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4883900Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4883973Z             )
2025-05-07T20:31:49.4884129Z         else:
2025-05-07T20:31:49.4884223Z             scale_ub_tensor = None
2025-05-07T20:31:49.4884291Z     
2025-05-07T20:31:49.4884420Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4884505Z             op = silu_mul_quant
2025-05-07T20:31:49.4884583Z             if compiled:
2025-05-07T20:31:49.4884756Z                 op = torch.compile(op)
2025-05-07T20:31:49.4884859Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4884928Z     
2025-05-07T20:31:49.4885019Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.4885023Z 
2025-05-07T20:31:49.4885116Z moe/activation_test.py:117: 
2025-05-07T20:31:49.4885238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4885339Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.4885433Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4885935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.4886031Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.4886386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4886607Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4886944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4887034Z     kernel = self.compile(
2025-05-07T20:31:49.4887408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4887579Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4887707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4887712Z 
2025-05-07T20:31:49.4887914Z self = <triton.compiler.compiler.ASTSource object at 0x7faa0358b820>
2025-05-07T20:31:49.4888687Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4889189Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa0373aee0>}
2025-05-07T20:31:49.4889929Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4890121Z context = <triton._C.libtriton.ir.context object at 0x7faa02c976b0>
2025-05-07T20:31:49.4890125Z 
2025-05-07T20:31:49.4890286Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4890554Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4890655Z                            module_map=module_map)
2025-05-07T20:31:49.4890813Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4890912Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.4890985Z E       ^
2025-05-07T20:31:49.4891340Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4891345Z 
2025-05-07T20:31:49.4891755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4891760Z 
2025-05-07T20:31:49.4891856Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4892077Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4892150Z     T=128,
2025-05-07T20:31:49.4892223Z     D=5120,
2025-05-07T20:31:49.4892304Z     scale_ub=None,
2025-05-07T20:31:49.4892494Z     contiguous=False,
2025-05-07T20:31:49.4892575Z     compiled=False,
2025-05-07T20:31:49.4892645Z )
2025-05-07T20:31:49.4892858Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4893050Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:49.4893155Z 
2025-05-07T20:31:49.4893243Z     @given(
2025-05-07T20:31:49.4893389Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4893507Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4893644Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4893784Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4893927Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4894012Z     )
2025-05-07T20:31:49.4894314Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4894430Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4894519Z         self,
2025-05-07T20:31:49.4894620Z         T: int,
2025-05-07T20:31:49.4894712Z         D: int,
2025-05-07T20:31:49.4894829Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4894936Z         contiguous: bool,
2025-05-07T20:31:49.4895035Z         compiled: bool,
2025-05-07T20:31:49.4895125Z     ) -> None:
2025-05-07T20:31:49.4895237Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4895320Z     
2025-05-07T20:31:49.4895487Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4895558Z     
2025-05-07T20:31:49.4895644Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4895762Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4895849Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4895925Z         x0 = x[:, :D]
2025-05-07T20:31:49.4895998Z         x1 = x[:, D:]
2025-05-07T20:31:49.4896067Z     
2025-05-07T20:31:49.4896144Z         if contiguous:
2025-05-07T20:31:49.4896231Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4896321Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4896388Z     
2025-05-07T20:31:49.4896479Z         if scale_ub is not None:
2025-05-07T20:31:49.4896577Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4896707Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4896780Z             )
2025-05-07T20:31:49.4896859Z         else:
2025-05-07T20:31:49.4896950Z             scale_ub_tensor = None
2025-05-07T20:31:49.4897020Z     
2025-05-07T20:31:49.4897145Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4897229Z             op = silu_mul_quant
2025-05-07T20:31:49.4897311Z             if compiled:
2025-05-07T20:31:49.4897405Z                 op = torch.compile(op)
2025-05-07T20:31:49.4897508Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4897578Z     
2025-05-07T20:31:49.4897663Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.4897667Z 
2025-05-07T20:31:49.4897765Z moe/activation_test.py:117: 
2025-05-07T20:31:49.4897893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4897988Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.4898086Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4898580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.4898679Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.4899038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4899258Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4899596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4899683Z     kernel = self.compile(
2025-05-07T20:31:49.4900057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4900316Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4900438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4900443Z 
2025-05-07T20:31:49.4900647Z self = <triton.compiler.compiler.ASTSource object at 0x7faa0361b220>
2025-05-07T20:31:49.4901493Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4901996Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa03d1c820>}
2025-05-07T20:31:49.4902751Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4902971Z context = <triton._C.libtriton.ir.context object at 0x7faa02c46270>
2025-05-07T20:31:49.4902977Z 
2025-05-07T20:31:49.4903181Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4903502Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4903638Z                            module_map=module_map)
2025-05-07T20:31:49.4904122Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4904287Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.4904382Z E       ^
2025-05-07T20:31:49.4904793Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4904798Z 
2025-05-07T20:31:49.4905205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4905210Z 
2025-05-07T20:31:49.4905318Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4905537Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4905614Z     T=128,
2025-05-07T20:31:49.4905686Z     D=5120,
2025-05-07T20:31:49.4905762Z     scale_ub=1200.0,
2025-05-07T20:31:49.4905852Z     contiguous=True,
2025-05-07T20:31:49.4905929Z     compiled=False,
2025-05-07T20:31:49.4905998Z )
2025-05-07T20:31:49.4906217Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4906382Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:49.4906386Z 
2025-05-07T20:31:49.4906462Z     @given(
2025-05-07T20:31:49.4906580Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4906676Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4906791Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4906905Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4907017Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4907093Z     )
2025-05-07T20:31:49.4907334Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4907424Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4907501Z         self,
2025-05-07T20:31:49.4907574Z         T: int,
2025-05-07T20:31:49.4907644Z         D: int,
2025-05-07T20:31:49.4907739Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4907824Z         contiguous: bool,
2025-05-07T20:31:49.4907904Z         compiled: bool,
2025-05-07T20:31:49.4907981Z     ) -> None:
2025-05-07T20:31:49.4908072Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4908146Z     
2025-05-07T20:31:49.4908309Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4908383Z     
2025-05-07T20:31:49.4908471Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4908591Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4908821Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4908901Z         x0 = x[:, :D]
2025-05-07T20:31:49.4908977Z         x1 = x[:, D:]
2025-05-07T20:31:49.4909046Z     
2025-05-07T20:31:49.4909127Z         if contiguous:
2025-05-07T20:31:49.4909213Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4909298Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4909484Z     
2025-05-07T20:31:49.4909570Z         if scale_ub is not None:
2025-05-07T20:31:49.4909672Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4909807Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4909936Z             )
2025-05-07T20:31:49.4910015Z         else:
2025-05-07T20:31:49.4910111Z             scale_ub_tensor = None
2025-05-07T20:31:49.4910180Z     
2025-05-07T20:31:49.4910309Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4910396Z             op = silu_mul_quant
2025-05-07T20:31:49.4910476Z             if compiled:
2025-05-07T20:31:49.4910578Z                 op = torch.compile(op)
2025-05-07T20:31:49.4910679Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4910747Z     
2025-05-07T20:31:49.4910837Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.4910842Z 
2025-05-07T20:31:49.4910935Z moe/activation_test.py:117: 
2025-05-07T20:31:49.4911068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4911165Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.4911259Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4911754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.4911848Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.4912201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4912422Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4912761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4912853Z     kernel = self.compile(
2025-05-07T20:31:49.4913277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4913451Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4913576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4913580Z 
2025-05-07T20:31:49.4913782Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02c73970>
2025-05-07T20:31:49.4914557Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4915063Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa03b8fdc0>}
2025-05-07T20:31:49.4915813Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4916009Z context = <triton._C.libtriton.ir.context object at 0x7faa02d48b70>
2025-05-07T20:31:49.4916014Z 
2025-05-07T20:31:49.4916175Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4916439Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4916541Z                            module_map=module_map)
2025-05-07T20:31:49.4916700Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4916799Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.4916873Z E       ^
2025-05-07T20:31:49.4917305Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4917314Z 
2025-05-07T20:31:49.4917723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4917800Z 
2025-05-07T20:31:49.4917899Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4918118Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4918192Z     T=1,
2025-05-07T20:31:49.4918264Z     D=7168,
2025-05-07T20:31:49.4918346Z     scale_ub=1200.0,
2025-05-07T20:31:49.4918423Z     contiguous=True,
2025-05-07T20:31:49.4918503Z     compiled=True,
2025-05-07T20:31:49.4918574Z )
2025-05-07T20:31:49.4918790Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4918955Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:49.4918960Z 
2025-05-07T20:31:49.4919039Z     @given(
2025-05-07T20:31:49.4919153Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4919251Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4919363Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4919476Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4919595Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4919664Z     )
2025-05-07T20:31:49.4919904Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4919998Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4920067Z         self,
2025-05-07T20:31:49.4920140Z         T: int,
2025-05-07T20:31:49.4920213Z         D: int,
2025-05-07T20:31:49.4920306Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4920396Z         contiguous: bool,
2025-05-07T20:31:49.4920481Z         compiled: bool,
2025-05-07T20:31:49.4920552Z     ) -> None:
2025-05-07T20:31:49.4920648Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4920721Z     
2025-05-07T20:31:49.4920883Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4920959Z     
2025-05-07T20:31:49.4921046Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4921167Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4921259Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4921334Z         x0 = x[:, :D]
2025-05-07T20:31:49.4921412Z         x1 = x[:, D:]
2025-05-07T20:31:49.4921482Z     
2025-05-07T20:31:49.4921559Z         if contiguous:
2025-05-07T20:31:49.4921652Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4921735Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4921802Z     
2025-05-07T20:31:49.4921894Z         if scale_ub is not None:
2025-05-07T20:31:49.4921994Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4922125Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4922198Z             )
2025-05-07T20:31:49.4922276Z         else:
2025-05-07T20:31:49.4922365Z             scale_ub_tensor = None
2025-05-07T20:31:49.4922437Z     
2025-05-07T20:31:49.4922561Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4922648Z             op = silu_mul_quant
2025-05-07T20:31:49.4922732Z             if compiled:
2025-05-07T20:31:49.4922835Z                 op = torch.compile(op)
2025-05-07T20:31:49.4922937Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4923007Z     
2025-05-07T20:31:49.4923103Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.4923108Z 
2025-05-07T20:31:49.4923219Z moe/activation_test.py:117: 
2025-05-07T20:31:49.4923368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4923463Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.4923559Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4923922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.4924127Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.4924618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.4924713Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.4925067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4925359Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4925690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4925779Z     kernel = self.compile(
2025-05-07T20:31:49.4926157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4926330Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4926458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4926462Z 
2025-05-07T20:31:49.4926666Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02c7e4c0>
2025-05-07T20:31:49.4927445Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4927950Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa0419cd30>}
2025-05-07T20:31:49.4928709Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4928896Z context = <triton._C.libtriton.ir.context object at 0x7faa031775f0>
2025-05-07T20:31:49.4928905Z 
2025-05-07T20:31:49.4929068Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4929327Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4929428Z                            module_map=module_map)
2025-05-07T20:31:49.4929592Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4929685Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.4929760Z E       ^
2025-05-07T20:31:49.4930115Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4930120Z 
2025-05-07T20:31:49.4930530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4930534Z 
2025-05-07T20:31:49.4930637Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4930859Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4930929Z     T=1,
2025-05-07T20:31:49.4931004Z     D=7168,
2025-05-07T20:31:49.4931083Z     scale_ub=1200.0,
2025-05-07T20:31:49.4931163Z     contiguous=False,
2025-05-07T20:31:49.4931243Z     compiled=True,
2025-05-07T20:31:49.4931311Z )
2025-05-07T20:31:49.4931529Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4931700Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:49.4931704Z 
2025-05-07T20:31:49.4931775Z     @given(
2025-05-07T20:31:49.4931894Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4931987Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4932099Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4932214Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4932322Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4932393Z     )
2025-05-07T20:31:49.4932721Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4932815Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4932887Z         self,
2025-05-07T20:31:49.4932963Z         T: int,
2025-05-07T20:31:49.4933034Z         D: int,
2025-05-07T20:31:49.4933130Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4933287Z         contiguous: bool,
2025-05-07T20:31:49.4933380Z         compiled: bool,
2025-05-07T20:31:49.4933471Z     ) -> None:
2025-05-07T20:31:49.4933570Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4933652Z     
2025-05-07T20:31:49.4933819Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4933886Z     
2025-05-07T20:31:49.4933972Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4934095Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4934179Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4934256Z         x0 = x[:, :D]
2025-05-07T20:31:49.4934335Z         x1 = x[:, D:]
2025-05-07T20:31:49.4934407Z     
2025-05-07T20:31:49.4934485Z         if contiguous:
2025-05-07T20:31:49.4934574Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4934660Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4934735Z     
2025-05-07T20:31:49.4934820Z         if scale_ub is not None:
2025-05-07T20:31:49.4934930Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4935064Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4935137Z             )
2025-05-07T20:31:49.4935207Z         else:
2025-05-07T20:31:49.4935301Z             scale_ub_tensor = None
2025-05-07T20:31:49.4935371Z     
2025-05-07T20:31:49.4935495Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4935586Z             op = silu_mul_quant
2025-05-07T20:31:49.4935664Z             if compiled:
2025-05-07T20:31:49.4935759Z                 op = torch.compile(op)
2025-05-07T20:31:49.4935863Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4935931Z     
2025-05-07T20:31:49.4936027Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.4936032Z 
2025-05-07T20:31:49.4936123Z moe/activation_test.py:117: 
2025-05-07T20:31:49.4936247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4936352Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.4936450Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4936813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.4936904Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.4937393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.4937488Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.4937838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4938069Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4938403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4938494Z     kernel = self.compile(
2025-05-07T20:31:49.4938869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4939043Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4939170Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4939175Z 
2025-05-07T20:31:49.4939379Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02dba7c0>
2025-05-07T20:31:49.4940155Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4940734Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa03b9cb80>}
2025-05-07T20:31:49.4941480Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4941744Z context = <triton._C.libtriton.ir.context object at 0x7faa030f51f0>
2025-05-07T20:31:49.4941749Z 
2025-05-07T20:31:49.4941911Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4942174Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4942276Z                            module_map=module_map)
2025-05-07T20:31:49.4942433Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4942534Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.4942612Z E       ^
2025-05-07T20:31:49.4942964Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4942972Z 
2025-05-07T20:31:49.4943379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4943387Z 
2025-05-07T20:31:49.4943483Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4943705Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4943780Z     T=1,
2025-05-07T20:31:49.4943854Z     D=7168,
2025-05-07T20:31:49.4943937Z     scale_ub=None,
2025-05-07T20:31:49.4944016Z     contiguous=False,
2025-05-07T20:31:49.4944094Z     compiled=True,
2025-05-07T20:31:49.4944168Z )
2025-05-07T20:31:49.4944382Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4944546Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:49.4944559Z 
2025-05-07T20:31:49.4944633Z     @given(
2025-05-07T20:31:49.4944746Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4944844Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4944953Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4945070Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4945183Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4945253Z     )
2025-05-07T20:31:49.4945495Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4945588Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4945660Z         self,
2025-05-07T20:31:49.4945734Z         T: int,
2025-05-07T20:31:49.4945806Z         D: int,
2025-05-07T20:31:49.4945902Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4945990Z         contiguous: bool,
2025-05-07T20:31:49.4946069Z         compiled: bool,
2025-05-07T20:31:49.4946141Z     ) -> None:
2025-05-07T20:31:49.4946238Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4946308Z     
2025-05-07T20:31:49.4946473Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4946547Z     
2025-05-07T20:31:49.4946637Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4946757Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4946851Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4946927Z         x0 = x[:, :D]
2025-05-07T20:31:49.4947004Z         x1 = x[:, D:]
2025-05-07T20:31:49.4947074Z     
2025-05-07T20:31:49.4947151Z         if contiguous:
2025-05-07T20:31:49.4947242Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4947325Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4947393Z     
2025-05-07T20:31:49.4947484Z         if scale_ub is not None:
2025-05-07T20:31:49.4947583Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4947711Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4947866Z             )
2025-05-07T20:31:49.4947938Z         else:
2025-05-07T20:31:49.4948028Z             scale_ub_tensor = None
2025-05-07T20:31:49.4948099Z     
2025-05-07T20:31:49.4948224Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4948308Z             op = silu_mul_quant
2025-05-07T20:31:49.4948563Z             if compiled:
2025-05-07T20:31:49.4948657Z                 op = torch.compile(op)
2025-05-07T20:31:49.4948765Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4948834Z     
2025-05-07T20:31:49.4948921Z         y_fp8, y_scale = fn()
2025-05-07T20:31:49.4949041Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:49.4949109Z     
2025-05-07T20:31:49.4949240Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4949341Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:49.4949436Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:49.4949553Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:49.4949696Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.4949769Z     
2025-05-07T20:31:49.4949922Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:49.4949927Z 
2025-05-07T20:31:49.4950022Z moe/activation_test.py:126: 
2025-05-07T20:31:49.4950145Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4950258Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:49.4950389Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.4950945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:49.4951044Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:49.4951400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4951626Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4951985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:49.4952239Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.4952681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:49.4952993Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.4953459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:49.4953664Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:49.4954081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:49.4954176Z     fn()
2025-05-07T20:31:49.4954673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:49.4954771Z     self.fn.run(
2025-05-07T20:31:49.4955189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4955298Z     kernel = self.compile(
2025-05-07T20:31:49.4955685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4955855Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4955979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4955984Z 
2025-05-07T20:31:49.4956190Z self = <triton.compiler.compiler.ASTSource object at 0x7faa030e78e0>
2025-05-07T20:31:49.4957046Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4957554Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa03105430>}
2025-05-07T20:31:49.4958297Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4958558Z context = <triton._C.libtriton.ir.context object at 0x7faa031067b0>
2025-05-07T20:31:49.4958563Z 
2025-05-07T20:31:49.4958724Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4958983Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4959090Z                            module_map=module_map)
2025-05-07T20:31:49.4959253Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4959350Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:49.4959429Z E       ^
2025-05-07T20:31:49.4959779Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4959784Z 
2025-05-07T20:31:49.4960201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4960205Z 
2025-05-07T20:31:49.4960303Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4960521Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4960597Z     T=1,
2025-05-07T20:31:49.4960667Z     D=5120,
2025-05-07T20:31:49.4960746Z     scale_ub=1200.0,
2025-05-07T20:31:49.4960835Z     contiguous=False,
2025-05-07T20:31:49.4960912Z     compiled=True,
2025-05-07T20:31:49.4960982Z )
2025-05-07T20:31:49.4961197Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4961364Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:49.4961368Z 
2025-05-07T20:31:49.4961445Z     @given(
2025-05-07T20:31:49.4961561Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4961653Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4961774Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4961886Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4961995Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4962068Z     )
2025-05-07T20:31:49.4962311Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4962398Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4962473Z         self,
2025-05-07T20:31:49.4962554Z         T: int,
2025-05-07T20:31:49.4962645Z         D: int,
2025-05-07T20:31:49.4962761Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4962865Z         contiguous: bool,
2025-05-07T20:31:49.4962976Z         compiled: bool,
2025-05-07T20:31:49.4963068Z     ) -> None:
2025-05-07T20:31:49.4963182Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4963271Z     
2025-05-07T20:31:49.4963477Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4963563Z     
2025-05-07T20:31:49.4963680Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4963829Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4963934Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4964033Z         x0 = x[:, :D]
2025-05-07T20:31:49.4964128Z         x1 = x[:, D:]
2025-05-07T20:31:49.4964214Z     
2025-05-07T20:31:49.4964315Z         if contiguous:
2025-05-07T20:31:49.4964421Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4964529Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4964616Z     
2025-05-07T20:31:49.4964719Z         if scale_ub is not None:
2025-05-07T20:31:49.4964823Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4965033Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4965106Z             )
2025-05-07T20:31:49.4965182Z         else:
2025-05-07T20:31:49.4965273Z             scale_ub_tensor = None
2025-05-07T20:31:49.4965341Z     
2025-05-07T20:31:49.4965468Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4965634Z             op = silu_mul_quant
2025-05-07T20:31:49.4965714Z             if compiled:
2025-05-07T20:31:49.4965811Z                 op = torch.compile(op)
2025-05-07T20:31:49.4965913Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4965985Z     
2025-05-07T20:31:49.4966071Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.4966075Z 
2025-05-07T20:31:49.4966168Z moe/activation_test.py:117: 
2025-05-07T20:31:49.4966293Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4966391Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.4966486Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4966861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.4966948Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.4967441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.4967537Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.4967889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4968112Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4968447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4968536Z     kernel = self.compile(
2025-05-07T20:31:49.4968914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4969090Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4969216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4969220Z 
2025-05-07T20:31:49.4969424Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02f3f610>
2025-05-07T20:31:49.4970204Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4970710Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa03105e50>}
2025-05-07T20:31:49.4971460Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4971653Z context = <triton._C.libtriton.ir.context object at 0x7faa02f04570>
2025-05-07T20:31:49.4971658Z 
2025-05-07T20:31:49.4971819Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4972083Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4972190Z                            module_map=module_map)
2025-05-07T20:31:49.4972349Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4972445Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.4972518Z E       ^
2025-05-07T20:31:49.4972870Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4972875Z 
2025-05-07T20:31:49.4973285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4973290Z 
2025-05-07T20:31:49.4973467Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4973693Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4973768Z     T=1,
2025-05-07T20:31:49.4973841Z     D=5120,
2025-05-07T20:31:49.4973923Z     scale_ub=1200.0,
2025-05-07T20:31:49.4974132Z     contiguous=False,
2025-05-07T20:31:49.4974210Z     compiled=False,
2025-05-07T20:31:49.4974284Z )
2025-05-07T20:31:49.4974498Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4974660Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:49.4974665Z 
2025-05-07T20:31:49.4974742Z     @given(
2025-05-07T20:31:49.4974856Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4974952Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4975062Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4975172Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4975289Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4975360Z     )
2025-05-07T20:31:49.4975603Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4975696Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4975769Z         self,
2025-05-07T20:31:49.4975844Z         T: int,
2025-05-07T20:31:49.4975921Z         D: int,
2025-05-07T20:31:49.4976018Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4976103Z         contiguous: bool,
2025-05-07T20:31:49.4976181Z         compiled: bool,
2025-05-07T20:31:49.4976253Z     ) -> None:
2025-05-07T20:31:49.4976346Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4976415Z     
2025-05-07T20:31:49.4976577Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4976649Z     
2025-05-07T20:31:49.4976734Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4976852Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4976942Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4977020Z         x0 = x[:, :D]
2025-05-07T20:31:49.4977093Z         x1 = x[:, D:]
2025-05-07T20:31:49.4977165Z     
2025-05-07T20:31:49.4977244Z         if contiguous:
2025-05-07T20:31:49.4977332Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4977418Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4977490Z     
2025-05-07T20:31:49.4977580Z         if scale_ub is not None:
2025-05-07T20:31:49.4977678Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4977808Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4977884Z             )
2025-05-07T20:31:49.4977958Z         else:
2025-05-07T20:31:49.4978052Z             scale_ub_tensor = None
2025-05-07T20:31:49.4978123Z     
2025-05-07T20:31:49.4978246Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4978331Z             op = silu_mul_quant
2025-05-07T20:31:49.4978414Z             if compiled:
2025-05-07T20:31:49.4978516Z                 op = torch.compile(op)
2025-05-07T20:31:49.4978616Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4978691Z     
2025-05-07T20:31:49.4978780Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.4978784Z 
2025-05-07T20:31:49.4978877Z moe/activation_test.py:117: 
2025-05-07T20:31:49.4979005Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4979110Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.4979209Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4979709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.4979814Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.4980172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4980401Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4985365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4985474Z     kernel = self.compile(
2025-05-07T20:31:49.4985870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4986143Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4986271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4986276Z 
2025-05-07T20:31:49.4986479Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02f359d0>
2025-05-07T20:31:49.4987255Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.4987770Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02f22820>}
2025-05-07T20:31:49.4988511Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.4988710Z context = <triton._C.libtriton.ir.context object at 0x7faa029983b0>
2025-05-07T20:31:49.4988715Z 
2025-05-07T20:31:49.4988875Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.4989135Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.4989241Z                            module_map=module_map)
2025-05-07T20:31:49.4989401Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.4989501Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.4989574Z E       ^
2025-05-07T20:31:49.4989991Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.4989997Z 
2025-05-07T20:31:49.4990410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.4990420Z 
2025-05-07T20:31:49.4990518Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.4990742Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.4990816Z     T=16384,
2025-05-07T20:31:49.4990889Z     D=5120,
2025-05-07T20:31:49.4990969Z     scale_ub=1200.0,
2025-05-07T20:31:49.4991049Z     contiguous=False,
2025-05-07T20:31:49.4991124Z     compiled=True,
2025-05-07T20:31:49.4991195Z )
2025-05-07T20:31:49.4991410Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.4991585Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:49.4991589Z 
2025-05-07T20:31:49.4991672Z     @given(
2025-05-07T20:31:49.4991788Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.4991886Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.4991999Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.4992111Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.4992225Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.4992295Z     )
2025-05-07T20:31:49.4992538Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.4992628Z     def test_silu_mul_quant(
2025-05-07T20:31:49.4992702Z         self,
2025-05-07T20:31:49.4992771Z         T: int,
2025-05-07T20:31:49.4992846Z         D: int,
2025-05-07T20:31:49.4992941Z         scale_ub: Optional[float],
2025-05-07T20:31:49.4993022Z         contiguous: bool,
2025-05-07T20:31:49.4993107Z         compiled: bool,
2025-05-07T20:31:49.4993183Z     ) -> None:
2025-05-07T20:31:49.4993363Z         torch.manual_seed(2025)
2025-05-07T20:31:49.4993456Z     
2025-05-07T20:31:49.4993645Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.4993716Z     
2025-05-07T20:31:49.4993803Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.4993924Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.4994088Z         x = x_sign * x_clamp
2025-05-07T20:31:49.4994162Z         x0 = x[:, :D]
2025-05-07T20:31:49.4994236Z         x1 = x[:, D:]
2025-05-07T20:31:49.4994307Z     
2025-05-07T20:31:49.4994387Z         if contiguous:
2025-05-07T20:31:49.4994475Z             x0 = x0.contiguous()
2025-05-07T20:31:49.4994561Z             x1 = x1.contiguous()
2025-05-07T20:31:49.4994632Z     
2025-05-07T20:31:49.4994719Z         if scale_ub is not None:
2025-05-07T20:31:49.4994823Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.4994955Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.4995032Z             )
2025-05-07T20:31:49.4995108Z         else:
2025-05-07T20:31:49.4995197Z             scale_ub_tensor = None
2025-05-07T20:31:49.4995271Z     
2025-05-07T20:31:49.4995395Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.4995482Z             op = silu_mul_quant
2025-05-07T20:31:49.4995566Z             if compiled:
2025-05-07T20:31:49.4995666Z                 op = torch.compile(op)
2025-05-07T20:31:49.4995769Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4995840Z     
2025-05-07T20:31:49.4995926Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.4995931Z 
2025-05-07T20:31:49.4996024Z moe/activation_test.py:117: 
2025-05-07T20:31:49.4996152Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4996250Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.4996349Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.4996716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.4996809Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.4997307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.4997400Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.4997754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.4997984Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.4998316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.4998409Z     kernel = self.compile(
2025-05-07T20:31:49.4998783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.4998953Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.4999081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.4999085Z 
2025-05-07T20:31:49.4999289Z self = <triton.compiler.compiler.ASTSource object at 0x7faa03138790>
2025-05-07T20:31:49.5000068Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5000579Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa0295d790>}
2025-05-07T20:31:49.5001328Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5001517Z context = <triton._C.libtriton.ir.context object at 0x7faa02ffbaf0>
2025-05-07T20:31:49.5001604Z 
2025-05-07T20:31:49.5001769Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5002033Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5002137Z                            module_map=module_map)
2025-05-07T20:31:49.5002373Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5002469Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5002545Z E       ^
2025-05-07T20:31:49.5002902Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5002907Z 
2025-05-07T20:31:49.5003315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5003320Z 
2025-05-07T20:31:49.5003417Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5003644Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5003925Z     T=2048,
2025-05-07T20:31:49.5004042Z     D=7168,
2025-05-07T20:31:49.5004154Z     scale_ub=1200.0,
2025-05-07T20:31:49.5004270Z     contiguous=False,
2025-05-07T20:31:49.5004390Z     compiled=True,
2025-05-07T20:31:49.5004486Z )
2025-05-07T20:31:49.5004771Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5004956Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:49.5004961Z 
2025-05-07T20:31:49.5005035Z     @given(
2025-05-07T20:31:49.5005150Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5005246Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5005355Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5005471Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5005580Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5005650Z     )
2025-05-07T20:31:49.5005900Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5005987Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5006060Z         self,
2025-05-07T20:31:49.5006137Z         T: int,
2025-05-07T20:31:49.5006212Z         D: int,
2025-05-07T20:31:49.5006309Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5006397Z         contiguous: bool,
2025-05-07T20:31:49.5006477Z         compiled: bool,
2025-05-07T20:31:49.5006555Z     ) -> None:
2025-05-07T20:31:49.5006645Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5006714Z     
2025-05-07T20:31:49.5006884Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5006954Z     
2025-05-07T20:31:49.5007039Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5007163Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5007247Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5007322Z         x0 = x[:, :D]
2025-05-07T20:31:49.5007404Z         x1 = x[:, D:]
2025-05-07T20:31:49.5007476Z     
2025-05-07T20:31:49.5007554Z         if contiguous:
2025-05-07T20:31:49.5007643Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5007731Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5007800Z     
2025-05-07T20:31:49.5007889Z         if scale_ub is not None:
2025-05-07T20:31:49.5007993Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5008126Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5008198Z             )
2025-05-07T20:31:49.5008270Z         else:
2025-05-07T20:31:49.5008364Z             scale_ub_tensor = None
2025-05-07T20:31:49.5008432Z     
2025-05-07T20:31:49.5008556Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5008645Z             op = silu_mul_quant
2025-05-07T20:31:49.5008726Z             if compiled:
2025-05-07T20:31:49.5008819Z                 op = torch.compile(op)
2025-05-07T20:31:49.5008926Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5009151Z     
2025-05-07T20:31:49.5009243Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5009248Z 
2025-05-07T20:31:49.5009342Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5009466Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5009565Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5009771Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5010142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.5010231Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.5010720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5010815Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5011166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5011397Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5011733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5011822Z     kernel = self.compile(
2025-05-07T20:31:49.5012197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5012377Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5012499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5012504Z 
2025-05-07T20:31:49.5012709Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02fe5b20>
2025-05-07T20:31:49.5013533Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5014044Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02a404c0>}
2025-05-07T20:31:49.5014781Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5014974Z context = <triton._C.libtriton.ir.context object at 0x7faa02c38230>
2025-05-07T20:31:49.5014979Z 
2025-05-07T20:31:49.5015144Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5015402Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5015505Z                            module_map=module_map)
2025-05-07T20:31:49.5015663Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5015757Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5015838Z E       ^
2025-05-07T20:31:49.5016188Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5016193Z 
2025-05-07T20:31:49.5016602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5016615Z 
2025-05-07T20:31:49.5016714Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5016931Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5017007Z     T=1,
2025-05-07T20:31:49.5017079Z     D=5120,
2025-05-07T20:31:49.5017154Z     scale_ub=None,
2025-05-07T20:31:49.5017240Z     contiguous=False,
2025-05-07T20:31:49.5017319Z     compiled=False,
2025-05-07T20:31:49.5017389Z )
2025-05-07T20:31:49.5017603Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5017875Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:49.5017880Z 
2025-05-07T20:31:49.5017954Z     @given(
2025-05-07T20:31:49.5018077Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5018170Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5018285Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5018473Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5018583Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5018658Z     )
2025-05-07T20:31:49.5018897Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5018988Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5019061Z         self,
2025-05-07T20:31:49.5019133Z         T: int,
2025-05-07T20:31:49.5019204Z         D: int,
2025-05-07T20:31:49.5019299Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5019384Z         contiguous: bool,
2025-05-07T20:31:49.5019467Z         compiled: bool,
2025-05-07T20:31:49.5019540Z     ) -> None:
2025-05-07T20:31:49.5019632Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5019703Z     
2025-05-07T20:31:49.5019869Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5019940Z     
2025-05-07T20:31:49.5020030Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5020153Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5020239Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5020318Z         x0 = x[:, :D]
2025-05-07T20:31:49.5020393Z         x1 = x[:, D:]
2025-05-07T20:31:49.5020461Z     
2025-05-07T20:31:49.5020541Z         if contiguous:
2025-05-07T20:31:49.5020626Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5020710Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5020778Z     
2025-05-07T20:31:49.5020866Z         if scale_ub is not None:
2025-05-07T20:31:49.5020970Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5021102Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5021182Z             )
2025-05-07T20:31:49.5021260Z         else:
2025-05-07T20:31:49.5021347Z             scale_ub_tensor = None
2025-05-07T20:31:49.5021415Z     
2025-05-07T20:31:49.5021542Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5021627Z             op = silu_mul_quant
2025-05-07T20:31:49.5021712Z             if compiled:
2025-05-07T20:31:49.5021813Z                 op = torch.compile(op)
2025-05-07T20:31:49.5021915Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5021984Z     
2025-05-07T20:31:49.5022074Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5022078Z 
2025-05-07T20:31:49.5022170Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5022300Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5022396Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5022493Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5022999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5023092Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5023476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5023725Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5024062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5024154Z     kernel = self.compile(
2025-05-07T20:31:49.5024529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5024701Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5024832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5024836Z 
2025-05-07T20:31:49.5025124Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02c03eb0>
2025-05-07T20:31:49.5025900Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5026476Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02a40820>}
2025-05-07T20:31:49.5027216Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5027413Z context = <triton._C.libtriton.ir.context object at 0x7faa02f89070>
2025-05-07T20:31:49.5027417Z 
2025-05-07T20:31:49.5027580Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5027848Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5027948Z                            module_map=module_map)
2025-05-07T20:31:49.5028105Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5028200Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5028280Z E       ^
2025-05-07T20:31:49.5028634Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5028639Z 
2025-05-07T20:31:49.5029044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5029049Z 
2025-05-07T20:31:49.5029148Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5029369Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5029442Z     T=4096,
2025-05-07T20:31:49.5029513Z     D=7168,
2025-05-07T20:31:49.5029602Z     scale_ub=1200.0,
2025-05-07T20:31:49.5029682Z     contiguous=False,
2025-05-07T20:31:49.5029765Z     compiled=False,
2025-05-07T20:31:49.5029892Z )
2025-05-07T20:31:49.5030107Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5030281Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:49.5030291Z 
2025-05-07T20:31:49.5030367Z     @given(
2025-05-07T20:31:49.5030483Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5030585Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5030697Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5030809Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5030925Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5030995Z     )
2025-05-07T20:31:49.5031236Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5031324Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5031400Z         self,
2025-05-07T20:31:49.5031476Z         T: int,
2025-05-07T20:31:49.5031547Z         D: int,
2025-05-07T20:31:49.5031640Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5031729Z         contiguous: bool,
2025-05-07T20:31:49.5031809Z         compiled: bool,
2025-05-07T20:31:49.5031888Z     ) -> None:
2025-05-07T20:31:49.5031981Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5032052Z     
2025-05-07T20:31:49.5032216Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5032289Z     
2025-05-07T20:31:49.5032374Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5032498Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5032581Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5032654Z         x0 = x[:, :D]
2025-05-07T20:31:49.5032733Z         x1 = x[:, D:]
2025-05-07T20:31:49.5032801Z     
2025-05-07T20:31:49.5032877Z         if contiguous:
2025-05-07T20:31:49.5033053Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5033141Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5033208Z     
2025-05-07T20:31:49.5033300Z         if scale_ub is not None:
2025-05-07T20:31:49.5033402Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5033531Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5033680Z             )
2025-05-07T20:31:49.5033751Z         else:
2025-05-07T20:31:49.5033840Z             scale_ub_tensor = None
2025-05-07T20:31:49.5033911Z     
2025-05-07T20:31:49.5034036Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5034124Z             op = silu_mul_quant
2025-05-07T20:31:49.5034208Z             if compiled:
2025-05-07T20:31:49.5034307Z                 op = torch.compile(op)
2025-05-07T20:31:49.5034412Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5034480Z     
2025-05-07T20:31:49.5034564Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5034568Z 
2025-05-07T20:31:49.5034668Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5034792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5034888Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5034982Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5035481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5035583Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5035937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5036154Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5036493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5036582Z     kernel = self.compile(
2025-05-07T20:31:49.5036967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5037137Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5037258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5037262Z 
2025-05-07T20:31:49.5037467Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02bc6ca0>
2025-05-07T20:31:49.5038244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5038753Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02f8faf0>}
2025-05-07T20:31:49.5039499Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5039688Z context = <triton._C.libtriton.ir.context object at 0x7faa0302aaf0>
2025-05-07T20:31:49.5039693Z 
2025-05-07T20:31:49.5039858Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5040122Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5040225Z                            module_map=module_map)
2025-05-07T20:31:49.5040382Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5040473Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5040546Z E       ^
2025-05-07T20:31:49.5040899Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5040903Z 
2025-05-07T20:31:49.5041391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5041401Z 
2025-05-07T20:31:49.5041499Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5041718Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5041799Z     T=16384,
2025-05-07T20:31:49.5041870Z     D=7168,
2025-05-07T20:31:49.5042022Z     scale_ub=None,
2025-05-07T20:31:49.5042107Z     contiguous=True,
2025-05-07T20:31:49.5042185Z     compiled=True,
2025-05-07T20:31:49.5042254Z )
2025-05-07T20:31:49.5042474Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5042645Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:49.5042650Z 
2025-05-07T20:31:49.5042727Z     @given(
2025-05-07T20:31:49.5042842Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5042940Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5043055Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5043195Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5043324Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5043403Z     )
2025-05-07T20:31:49.5043644Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5043733Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5043813Z         self,
2025-05-07T20:31:49.5043886Z         T: int,
2025-05-07T20:31:49.5043959Z         D: int,
2025-05-07T20:31:49.5044056Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5044139Z         contiguous: bool,
2025-05-07T20:31:49.5044222Z         compiled: bool,
2025-05-07T20:31:49.5044298Z     ) -> None:
2025-05-07T20:31:49.5044387Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5044461Z     
2025-05-07T20:31:49.5044625Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5044693Z     
2025-05-07T20:31:49.5044782Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5044905Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5044989Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5045073Z         x0 = x[:, :D]
2025-05-07T20:31:49.5045147Z         x1 = x[:, D:]
2025-05-07T20:31:49.5045216Z     
2025-05-07T20:31:49.5045299Z         if contiguous:
2025-05-07T20:31:49.5045386Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5045478Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5045547Z     
2025-05-07T20:31:49.5045634Z         if scale_ub is not None:
2025-05-07T20:31:49.5045736Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5045867Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5045940Z             )
2025-05-07T20:31:49.5046016Z         else:
2025-05-07T20:31:49.5046107Z             scale_ub_tensor = None
2025-05-07T20:31:49.5046176Z     
2025-05-07T20:31:49.5046304Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5046389Z             op = silu_mul_quant
2025-05-07T20:31:49.5046472Z             if compiled:
2025-05-07T20:31:49.5046575Z                 op = torch.compile(op)
2025-05-07T20:31:49.5046677Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5046748Z     
2025-05-07T20:31:49.5046831Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5046836Z 
2025-05-07T20:31:49.5046931Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5047064Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5047160Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5047253Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5047618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.5047707Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.5048201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5048293Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5048753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5048977Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5049309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5049470Z     kernel = self.compile(
2025-05-07T20:31:49.5049849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5050019Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5050143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5050147Z 
2025-05-07T20:31:49.5050351Z self = <triton.compiler.compiler.ASTSource object at 0x7faa03028790>
2025-05-07T20:31:49.5051126Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5051627Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02e67790>}
2025-05-07T20:31:49.5052370Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5052561Z context = <triton._C.libtriton.ir.context object at 0x7faa02e440f0>
2025-05-07T20:31:49.5052565Z 
2025-05-07T20:31:49.5052725Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5052982Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5053093Z                            module_map=module_map)
2025-05-07T20:31:49.5053253Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5053366Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5053447Z E       ^
2025-05-07T20:31:49.5053822Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5053831Z 
2025-05-07T20:31:49.5054245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5054249Z 
2025-05-07T20:31:49.5054348Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5054570Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5054640Z     T=4096,
2025-05-07T20:31:49.5054710Z     D=5120,
2025-05-07T20:31:49.5054791Z     scale_ub=None,
2025-05-07T20:31:49.5054872Z     contiguous=False,
2025-05-07T20:31:49.5054950Z     compiled=True,
2025-05-07T20:31:49.5055023Z )
2025-05-07T20:31:49.5055239Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5055407Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:49.5055412Z 
2025-05-07T20:31:49.5055487Z     @given(
2025-05-07T20:31:49.5055602Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5055706Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5055817Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5055932Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5056044Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5056114Z     )
2025-05-07T20:31:49.5056356Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5056446Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5056516Z         self,
2025-05-07T20:31:49.5056586Z         T: int,
2025-05-07T20:31:49.5056657Z         D: int,
2025-05-07T20:31:49.5056839Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5056926Z         contiguous: bool,
2025-05-07T20:31:49.5057007Z         compiled: bool,
2025-05-07T20:31:49.5057079Z     ) -> None:
2025-05-07T20:31:49.5057173Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5057238Z     
2025-05-07T20:31:49.5057402Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5057549Z     
2025-05-07T20:31:49.5057636Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5057758Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5057844Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5057919Z         x0 = x[:, :D]
2025-05-07T20:31:49.5057993Z         x1 = x[:, D:]
2025-05-07T20:31:49.5058067Z     
2025-05-07T20:31:49.5058145Z         if contiguous:
2025-05-07T20:31:49.5058232Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5058320Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5058386Z     
2025-05-07T20:31:49.5058482Z         if scale_ub is not None:
2025-05-07T20:31:49.5058584Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5058714Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5058791Z             )
2025-05-07T20:31:49.5058863Z         else:
2025-05-07T20:31:49.5058951Z             scale_ub_tensor = None
2025-05-07T20:31:49.5059028Z     
2025-05-07T20:31:49.5059152Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5059236Z             op = silu_mul_quant
2025-05-07T20:31:49.5059321Z             if compiled:
2025-05-07T20:31:49.5059415Z                 op = torch.compile(op)
2025-05-07T20:31:49.5059516Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5059588Z     
2025-05-07T20:31:49.5059673Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5059677Z 
2025-05-07T20:31:49.5059773Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5059896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5059997Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5060097Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5060457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.5060548Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.5061044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5061138Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5061491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5061708Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5062044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5062135Z     kernel = self.compile(
2025-05-07T20:31:49.5062517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5062687Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5062808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5062817Z 
2025-05-07T20:31:49.5063020Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02ed3880>
2025-05-07T20:31:49.5063791Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5064296Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02b5d550>}
2025-05-07T20:31:49.5065116Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5065306Z context = <triton._C.libtriton.ir.context object at 0x7faa02b4a730>
2025-05-07T20:31:49.5065311Z 
2025-05-07T20:31:49.5065476Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5065807Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5065913Z                            module_map=module_map)
2025-05-07T20:31:49.5066073Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5066166Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5066245Z E       ^
2025-05-07T20:31:49.5066603Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5066608Z 
2025-05-07T20:31:49.5067024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5067034Z 
2025-05-07T20:31:49.5067133Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5067351Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5067429Z     T=4096,
2025-05-07T20:31:49.5067506Z     D=5120,
2025-05-07T20:31:49.5067584Z     scale_ub=1200.0,
2025-05-07T20:31:49.5067668Z     contiguous=False,
2025-05-07T20:31:49.5067745Z     compiled=False,
2025-05-07T20:31:49.5067813Z )
2025-05-07T20:31:49.5068031Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5068201Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:49.5068205Z 
2025-05-07T20:31:49.5068280Z     @given(
2025-05-07T20:31:49.5068395Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5068488Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5068607Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5068720Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5068828Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5068900Z     )
2025-05-07T20:31:49.5069141Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5069234Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5069306Z         self,
2025-05-07T20:31:49.5069377Z         T: int,
2025-05-07T20:31:49.5069448Z         D: int,
2025-05-07T20:31:49.5069544Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5069629Z         contiguous: bool,
2025-05-07T20:31:49.5069717Z         compiled: bool,
2025-05-07T20:31:49.5069790Z     ) -> None:
2025-05-07T20:31:49.5069942Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5070016Z     
2025-05-07T20:31:49.5070182Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5070252Z     
2025-05-07T20:31:49.5070347Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5070465Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5070552Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5070632Z         x0 = x[:, :D]
2025-05-07T20:31:49.5070705Z         x1 = x[:, D:]
2025-05-07T20:31:49.5070775Z     
2025-05-07T20:31:49.5070854Z         if contiguous:
2025-05-07T20:31:49.5070945Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5071032Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5071101Z     
2025-05-07T20:31:49.5071187Z         if scale_ub is not None:
2025-05-07T20:31:49.5071290Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5071419Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5071492Z             )
2025-05-07T20:31:49.5071564Z         else:
2025-05-07T20:31:49.5071653Z             scale_ub_tensor = None
2025-05-07T20:31:49.5071722Z     
2025-05-07T20:31:49.5071849Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5072020Z             op = silu_mul_quant
2025-05-07T20:31:49.5072102Z             if compiled:
2025-05-07T20:31:49.5072199Z                 op = torch.compile(op)
2025-05-07T20:31:49.5072301Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5072372Z     
2025-05-07T20:31:49.5072456Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5072555Z 
2025-05-07T20:31:49.5072648Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5072773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5072871Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5072966Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5073520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5073612Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5073966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5074189Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5074523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5074614Z     kernel = self.compile(
2025-05-07T20:31:49.5074996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5075165Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5075288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5075292Z 
2025-05-07T20:31:49.5075494Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02a8bf10>
2025-05-07T20:31:49.5076269Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5076768Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02ebd0d0>}
2025-05-07T20:31:49.5077506Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5077699Z context = <triton._C.libtriton.ir.context object at 0x7faa02e943f0>
2025-05-07T20:31:49.5077704Z 
2025-05-07T20:31:49.5077864Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5078123Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5078223Z                            module_map=module_map)
2025-05-07T20:31:49.5078384Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5078486Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5078558Z E       ^
2025-05-07T20:31:49.5078913Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5078918Z 
2025-05-07T20:31:49.5079325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5079334Z 
2025-05-07T20:31:49.5079430Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5079650Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5079723Z     T=4096,
2025-05-07T20:31:49.5079799Z     D=5120,
2025-05-07T20:31:49.5079877Z     scale_ub=1200.0,
2025-05-07T20:31:49.5079955Z     contiguous=False,
2025-05-07T20:31:49.5080033Z     compiled=True,
2025-05-07T20:31:49.5080102Z )
2025-05-07T20:31:49.5080314Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5080638Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:49.5080643Z 
2025-05-07T20:31:49.5080718Z     @given(
2025-05-07T20:31:49.5080832Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5080932Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5081043Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5081234Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5081343Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5081413Z     )
2025-05-07T20:31:49.5081655Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5081746Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5081816Z         self,
2025-05-07T20:31:49.5081891Z         T: int,
2025-05-07T20:31:49.5081963Z         D: int,
2025-05-07T20:31:49.5082054Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5082141Z         contiguous: bool,
2025-05-07T20:31:49.5082226Z         compiled: bool,
2025-05-07T20:31:49.5082298Z     ) -> None:
2025-05-07T20:31:49.5082392Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5082460Z     
2025-05-07T20:31:49.5082626Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5082700Z     
2025-05-07T20:31:49.5082788Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5082914Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5082998Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5083073Z         x0 = x[:, :D]
2025-05-07T20:31:49.5083152Z         x1 = x[:, D:]
2025-05-07T20:31:49.5083221Z     
2025-05-07T20:31:49.5083317Z         if contiguous:
2025-05-07T20:31:49.5083413Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5083518Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5083587Z     
2025-05-07T20:31:49.5083676Z         if scale_ub is not None:
2025-05-07T20:31:49.5083777Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5083911Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5083985Z             )
2025-05-07T20:31:49.5084054Z         else:
2025-05-07T20:31:49.5084145Z             scale_ub_tensor = None
2025-05-07T20:31:49.5084213Z     
2025-05-07T20:31:49.5084338Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5084431Z             op = silu_mul_quant
2025-05-07T20:31:49.5084510Z             if compiled:
2025-05-07T20:31:49.5084604Z                 op = torch.compile(op)
2025-05-07T20:31:49.5084708Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5084774Z     
2025-05-07T20:31:49.5084862Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5084866Z 
2025-05-07T20:31:49.5084960Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5085085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5085185Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5085280Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5085645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.5085735Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.5086223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5086320Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5086674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5086892Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5087228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5087317Z     kernel = self.compile(
2025-05-07T20:31:49.5087693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5087949Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5088073Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5088078Z 
2025-05-07T20:31:49.5088278Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02edce20>
2025-05-07T20:31:49.5089128Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5089627Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02ebddc0>}
2025-05-07T20:31:49.5090369Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5090564Z context = <triton._C.libtriton.ir.context object at 0x7faa029ce670>
2025-05-07T20:31:49.5090568Z 
2025-05-07T20:31:49.5090730Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5090988Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5091096Z                            module_map=module_map)
2025-05-07T20:31:49.5091266Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5091358Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5091433Z E       ^
2025-05-07T20:31:49.5091787Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5091791Z 
2025-05-07T20:31:49.5092200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5092205Z 
2025-05-07T20:31:49.5092310Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5092528Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5092602Z     T=2048,
2025-05-07T20:31:49.5092678Z     D=7168,
2025-05-07T20:31:49.5092756Z     scale_ub=1200.0,
2025-05-07T20:31:49.5092841Z     contiguous=False,
2025-05-07T20:31:49.5092928Z     compiled=False,
2025-05-07T20:31:49.5092998Z )
2025-05-07T20:31:49.5093216Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5093385Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:49.5093389Z 
2025-05-07T20:31:49.5093462Z     @given(
2025-05-07T20:31:49.5093577Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5093672Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5093781Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5093894Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5094008Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5094083Z     )
2025-05-07T20:31:49.5094325Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5094412Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5094485Z         self,
2025-05-07T20:31:49.5094562Z         T: int,
2025-05-07T20:31:49.5094633Z         D: int,
2025-05-07T20:31:49.5094730Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5094815Z         contiguous: bool,
2025-05-07T20:31:49.5094894Z         compiled: bool,
2025-05-07T20:31:49.5094971Z     ) -> None:
2025-05-07T20:31:49.5095061Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5095132Z     
2025-05-07T20:31:49.5095301Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5095370Z     
2025-05-07T20:31:49.5095458Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5095578Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5095746Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5095826Z         x0 = x[:, :D]
2025-05-07T20:31:49.5095899Z         x1 = x[:, D:]
2025-05-07T20:31:49.5095968Z     
2025-05-07T20:31:49.5096049Z         if contiguous:
2025-05-07T20:31:49.5096135Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5096219Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5096365Z     
2025-05-07T20:31:49.5096450Z         if scale_ub is not None:
2025-05-07T20:31:49.5096551Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5096684Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5096755Z             )
2025-05-07T20:31:49.5096828Z         else:
2025-05-07T20:31:49.5096919Z             scale_ub_tensor = None
2025-05-07T20:31:49.5096988Z     
2025-05-07T20:31:49.5097116Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5097201Z             op = silu_mul_quant
2025-05-07T20:31:49.5097282Z             if compiled:
2025-05-07T20:31:49.5097387Z                 op = torch.compile(op)
2025-05-07T20:31:49.5097489Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5097557Z     
2025-05-07T20:31:49.5097644Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5097649Z 
2025-05-07T20:31:49.5097742Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5097865Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5097966Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5098060Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5098559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5098653Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5099009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5099230Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5099567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5099655Z     kernel = self.compile(
2025-05-07T20:31:49.5100034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5100210Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5100333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5100338Z 
2025-05-07T20:31:49.5100544Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02ec2190>
2025-05-07T20:31:49.5101319Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5101822Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa029de670>}
2025-05-07T20:31:49.5102568Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5102761Z context = <triton._C.libtriton.ir.context object at 0x7faa028daf30>
2025-05-07T20:31:49.5102766Z 
2025-05-07T20:31:49.5102928Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5103194Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5103316Z                            module_map=module_map)
2025-05-07T20:31:49.5103502Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5103596Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5108367Z E       ^
2025-05-07T20:31:49.5108909Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5108916Z 
2025-05-07T20:31:49.5109336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5109341Z 
2025-05-07T20:31:49.5109578Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5109798Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5109931Z     T=1,
2025-05-07T20:31:49.5110008Z     D=7168,
2025-05-07T20:31:49.5110086Z     scale_ub=None,
2025-05-07T20:31:49.5110165Z     contiguous=True,
2025-05-07T20:31:49.5110248Z     compiled=False,
2025-05-07T20:31:49.5110317Z )
2025-05-07T20:31:49.5110530Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5110696Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:49.5110701Z 
2025-05-07T20:31:49.5110776Z     @given(
2025-05-07T20:31:49.5110905Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5111001Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5111112Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5111225Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5111343Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5111415Z     )
2025-05-07T20:31:49.5111661Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5111748Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5111823Z         self,
2025-05-07T20:31:49.5111893Z         T: int,
2025-05-07T20:31:49.5111962Z         D: int,
2025-05-07T20:31:49.5112057Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5112141Z         contiguous: bool,
2025-05-07T20:31:49.5112222Z         compiled: bool,
2025-05-07T20:31:49.5112301Z     ) -> None:
2025-05-07T20:31:49.5112391Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5112465Z     
2025-05-07T20:31:49.5112632Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5112703Z     
2025-05-07T20:31:49.5112791Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5112914Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5112999Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5113080Z         x0 = x[:, :D]
2025-05-07T20:31:49.5113162Z         x1 = x[:, D:]
2025-05-07T20:31:49.5113229Z     
2025-05-07T20:31:49.5113322Z         if contiguous:
2025-05-07T20:31:49.5113423Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5113519Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5113604Z     
2025-05-07T20:31:49.5113692Z         if scale_ub is not None:
2025-05-07T20:31:49.5113796Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5113934Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5114007Z             )
2025-05-07T20:31:49.5114079Z         else:
2025-05-07T20:31:49.5114180Z             scale_ub_tensor = None
2025-05-07T20:31:49.5114248Z     
2025-05-07T20:31:49.5114374Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5114464Z             op = silu_mul_quant
2025-05-07T20:31:49.5114544Z             if compiled:
2025-05-07T20:31:49.5114642Z                 op = torch.compile(op)
2025-05-07T20:31:49.5114749Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5114820Z     
2025-05-07T20:31:49.5114908Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5114912Z 
2025-05-07T20:31:49.5115009Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5115132Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5115232Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5115327Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5115829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5116009Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5116367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5116592Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5117003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5117090Z     kernel = self.compile(
2025-05-07T20:31:49.5117473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5117644Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5117769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5117774Z 
2025-05-07T20:31:49.5117978Z self = <triton.compiler.compiler.ASTSource object at 0x7faa029fd3d0>
2025-05-07T20:31:49.5118762Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5119268Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa0285c160>}
2025-05-07T20:31:49.5120018Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5120211Z context = <triton._C.libtriton.ir.context object at 0x7faa02846e70>
2025-05-07T20:31:49.5120215Z 
2025-05-07T20:31:49.5120379Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5120648Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5120755Z                            module_map=module_map)
2025-05-07T20:31:49.5120916Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5121014Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5121086Z E       ^
2025-05-07T20:31:49.5121438Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5121447Z 
2025-05-07T20:31:49.5121857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5121862Z 
2025-05-07T20:31:49.5121960Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5122180Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5122254Z     T=16384,
2025-05-07T20:31:49.5122326Z     D=7168,
2025-05-07T20:31:49.5122408Z     scale_ub=1200.0,
2025-05-07T20:31:49.5122491Z     contiguous=False,
2025-05-07T20:31:49.5122573Z     compiled=True,
2025-05-07T20:31:49.5122645Z )
2025-05-07T20:31:49.5122857Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5123028Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:49.5123033Z 
2025-05-07T20:31:49.5123110Z     @given(
2025-05-07T20:31:49.5123231Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5123329Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5123437Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5123548Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5123663Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5123734Z     )
2025-05-07T20:31:49.5123973Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5124063Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5124135Z         self,
2025-05-07T20:31:49.5124206Z         T: int,
2025-05-07T20:31:49.5124365Z         D: int,
2025-05-07T20:31:49.5124462Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5124545Z         contiguous: bool,
2025-05-07T20:31:49.5124626Z         compiled: bool,
2025-05-07T20:31:49.5124699Z     ) -> None:
2025-05-07T20:31:49.5124790Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5124934Z     
2025-05-07T20:31:49.5125100Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5125173Z     
2025-05-07T20:31:49.5125259Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5125379Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5125466Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5125542Z         x0 = x[:, :D]
2025-05-07T20:31:49.5125615Z         x1 = x[:, D:]
2025-05-07T20:31:49.5125687Z     
2025-05-07T20:31:49.5125766Z         if contiguous:
2025-05-07T20:31:49.5125854Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5125938Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5126011Z     
2025-05-07T20:31:49.5126099Z         if scale_ub is not None:
2025-05-07T20:31:49.5126198Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5126329Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5126406Z             )
2025-05-07T20:31:49.5126480Z         else:
2025-05-07T20:31:49.5126574Z             scale_ub_tensor = None
2025-05-07T20:31:49.5126643Z     
2025-05-07T20:31:49.5126768Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5126857Z             op = silu_mul_quant
2025-05-07T20:31:49.5126938Z             if compiled:
2025-05-07T20:31:49.5127033Z                 op = torch.compile(op)
2025-05-07T20:31:49.5127135Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5127204Z     
2025-05-07T20:31:49.5127287Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5127291Z 
2025-05-07T20:31:49.5127384Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5127516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5127611Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5127711Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5128072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.5128169Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.5128659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5128752Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5129106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5129326Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5129659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5129754Z     kernel = self.compile(
2025-05-07T20:31:49.5130127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5130302Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5130422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5130432Z 
2025-05-07T20:31:49.5130635Z self = <triton.compiler.compiler.ASTSource object at 0x7faa0284dd30>
2025-05-07T20:31:49.5131414Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5131918Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa0285c4c0>}
2025-05-07T20:31:49.5132744Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5132940Z context = <triton._C.libtriton.ir.context object at 0x7faa02b073f0>
2025-05-07T20:31:49.5133018Z 
2025-05-07T20:31:49.5133213Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5133498Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5133602Z                            module_map=module_map)
2025-05-07T20:31:49.5133766Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5133859Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5133933Z E       ^
2025-05-07T20:31:49.5134292Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5134297Z 
2025-05-07T20:31:49.5134709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5134714Z 
2025-05-07T20:31:49.5134815Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5135033Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5135110Z     T=1,
2025-05-07T20:31:49.5135181Z     D=7168,
2025-05-07T20:31:49.5135257Z     scale_ub=None,
2025-05-07T20:31:49.5135338Z     contiguous=False,
2025-05-07T20:31:49.5135421Z     compiled=False,
2025-05-07T20:31:49.5135490Z )
2025-05-07T20:31:49.5135707Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5135874Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:49.5135879Z 
2025-05-07T20:31:49.5135952Z     @given(
2025-05-07T20:31:49.5136071Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5136165Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5136279Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5136395Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5136504Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5136575Z     )
2025-05-07T20:31:49.5136818Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5136910Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5136982Z         self,
2025-05-07T20:31:49.5137058Z         T: int,
2025-05-07T20:31:49.5137128Z         D: int,
2025-05-07T20:31:49.5137224Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5137306Z         contiguous: bool,
2025-05-07T20:31:49.5137385Z         compiled: bool,
2025-05-07T20:31:49.5137460Z     ) -> None:
2025-05-07T20:31:49.5137548Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5137616Z     
2025-05-07T20:31:49.5137783Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5137853Z     
2025-05-07T20:31:49.5137946Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5138069Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5138154Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5138226Z         x0 = x[:, :D]
2025-05-07T20:31:49.5138305Z         x1 = x[:, D:]
2025-05-07T20:31:49.5138377Z     
2025-05-07T20:31:49.5138458Z         if contiguous:
2025-05-07T20:31:49.5138545Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5138630Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5138701Z     
2025-05-07T20:31:49.5138789Z         if scale_ub is not None:
2025-05-07T20:31:49.5138891Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5139021Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5139094Z             )
2025-05-07T20:31:49.5139166Z         else:
2025-05-07T20:31:49.5139260Z             scale_ub_tensor = None
2025-05-07T20:31:49.5139327Z     
2025-05-07T20:31:49.5139557Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5139647Z             op = silu_mul_quant
2025-05-07T20:31:49.5139727Z             if compiled:
2025-05-07T20:31:49.5139820Z                 op = torch.compile(op)
2025-05-07T20:31:49.5139925Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5139993Z     
2025-05-07T20:31:49.5140157Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5140161Z 
2025-05-07T20:31:49.5140251Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5140375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5140474Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5140569Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5141073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5141170Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5141531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5141753Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5142086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5142172Z     kernel = self.compile(
2025-05-07T20:31:49.5142555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5142726Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5142852Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5142857Z 
2025-05-07T20:31:49.5143060Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02d14040>
2025-05-07T20:31:49.5143893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5144399Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02d1c820>}
2025-05-07T20:31:49.5145138Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5145335Z context = <triton._C.libtriton.ir.context object at 0x7faa02911430>
2025-05-07T20:31:49.5145340Z 
2025-05-07T20:31:49.5145500Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5145757Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5145866Z                            module_map=module_map)
2025-05-07T20:31:49.5146028Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5146128Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5146201Z E       ^
2025-05-07T20:31:49.5146554Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5146558Z 
2025-05-07T20:31:49.5146974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5146979Z 
2025-05-07T20:31:49.5147076Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5147297Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5147370Z     T=2048,
2025-05-07T20:31:49.5147442Z     D=7168,
2025-05-07T20:31:49.5147526Z     scale_ub=None,
2025-05-07T20:31:49.5147605Z     contiguous=False,
2025-05-07T20:31:49.5147680Z     compiled=True,
2025-05-07T20:31:49.5147755Z )
2025-05-07T20:31:49.5148056Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5148227Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:49.5148231Z 
2025-05-07T20:31:49.5148305Z     @given(
2025-05-07T20:31:49.5148419Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5148513Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5148699Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5148811Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5148924Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5148994Z     )
2025-05-07T20:31:49.5149236Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5149327Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5149400Z         self,
2025-05-07T20:31:49.5149469Z         T: int,
2025-05-07T20:31:49.5149545Z         D: int,
2025-05-07T20:31:49.5149639Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5149727Z         contiguous: bool,
2025-05-07T20:31:49.5149811Z         compiled: bool,
2025-05-07T20:31:49.5149945Z     ) -> None:
2025-05-07T20:31:49.5150036Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5150109Z     
2025-05-07T20:31:49.5150273Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5150356Z     
2025-05-07T20:31:49.5150443Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5150561Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5150650Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5150723Z         x0 = x[:, :D]
2025-05-07T20:31:49.5150799Z         x1 = x[:, D:]
2025-05-07T20:31:49.5150869Z     
2025-05-07T20:31:49.5150947Z         if contiguous:
2025-05-07T20:31:49.5151030Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5151118Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5151187Z     
2025-05-07T20:31:49.5151273Z         if scale_ub is not None:
2025-05-07T20:31:49.5151376Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5151512Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5151587Z             )
2025-05-07T20:31:49.5151661Z         else:
2025-05-07T20:31:49.5151751Z             scale_ub_tensor = None
2025-05-07T20:31:49.5151822Z     
2025-05-07T20:31:49.5151946Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5152034Z             op = silu_mul_quant
2025-05-07T20:31:49.5152117Z             if compiled:
2025-05-07T20:31:49.5152211Z                 op = torch.compile(op)
2025-05-07T20:31:49.5152310Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5152382Z     
2025-05-07T20:31:49.5152467Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5152471Z 
2025-05-07T20:31:49.5152566Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5152693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5152789Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5152894Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5153264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.5153352Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.5153848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5153945Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5154296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5154518Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5154851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5154942Z     kernel = self.compile(
2025-05-07T20:31:49.5155398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5155570Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5155695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5155700Z 
2025-05-07T20:31:49.5155902Z self = <triton.compiler.compiler.ASTSource object at 0x7faa0293a610>
2025-05-07T20:31:49.5156754Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5157255Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02acf790>}
2025-05-07T20:31:49.5158010Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5158197Z context = <triton._C.libtriton.ir.context object at 0x7faa027b9e70>
2025-05-07T20:31:49.5158202Z 
2025-05-07T20:31:49.5158361Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5158621Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5158726Z                            module_map=module_map)
2025-05-07T20:31:49.5158883Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5158978Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5159054Z E       ^
2025-05-07T20:31:49.5159408Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5159412Z 
2025-05-07T20:31:49.5159821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5159830Z 
2025-05-07T20:31:49.5159929Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5160149Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5160222Z     T=4096,
2025-05-07T20:31:49.5160296Z     D=7168,
2025-05-07T20:31:49.5160379Z     scale_ub=None,
2025-05-07T20:31:49.5160463Z     contiguous=False,
2025-05-07T20:31:49.5160544Z     compiled=True,
2025-05-07T20:31:49.5160615Z )
2025-05-07T20:31:49.5160835Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5161004Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:49.5161009Z 
2025-05-07T20:31:49.5161077Z     @given(
2025-05-07T20:31:49.5161190Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5161285Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5161395Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5161511Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5161623Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5161694Z     )
2025-05-07T20:31:49.5161939Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5162028Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5162102Z         self,
2025-05-07T20:31:49.5162182Z         T: int,
2025-05-07T20:31:49.5162252Z         D: int,
2025-05-07T20:31:49.5162347Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5162434Z         contiguous: bool,
2025-05-07T20:31:49.5162513Z         compiled: bool,
2025-05-07T20:31:49.5162587Z     ) -> None:
2025-05-07T20:31:49.5162682Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5162751Z     
2025-05-07T20:31:49.5162918Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5162988Z     
2025-05-07T20:31:49.5163098Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5163233Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5163410Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5163485Z         x0 = x[:, :D]
2025-05-07T20:31:49.5163564Z         x1 = x[:, D:]
2025-05-07T20:31:49.5163630Z     
2025-05-07T20:31:49.5163707Z         if contiguous:
2025-05-07T20:31:49.5163798Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5163955Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5164024Z     
2025-05-07T20:31:49.5164114Z         if scale_ub is not None:
2025-05-07T20:31:49.5164218Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5164350Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5164427Z             )
2025-05-07T20:31:49.5164500Z         else:
2025-05-07T20:31:49.5164592Z             scale_ub_tensor = None
2025-05-07T20:31:49.5164661Z     
2025-05-07T20:31:49.5164785Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5164873Z             op = silu_mul_quant
2025-05-07T20:31:49.5164953Z             if compiled:
2025-05-07T20:31:49.5165051Z                 op = torch.compile(op)
2025-05-07T20:31:49.5165156Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5165224Z     
2025-05-07T20:31:49.5165311Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5165315Z 
2025-05-07T20:31:49.5165411Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5165538Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5165636Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5165730Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5166095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.5166185Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.5166677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5166769Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5167128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5167347Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5167685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5167779Z     kernel = self.compile(
2025-05-07T20:31:49.5168154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5168327Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5168447Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5168452Z 
2025-05-07T20:31:49.5168654Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02791370>
2025-05-07T20:31:49.5169434Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5169935Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa028114c0>}
2025-05-07T20:31:49.5170684Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5170871Z context = <triton._C.libtriton.ir.context object at 0x7faa026755f0>
2025-05-07T20:31:49.5170876Z 
2025-05-07T20:31:49.5171040Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5171299Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5171400Z                            module_map=module_map)
2025-05-07T20:31:49.5171670Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5171767Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5171839Z E       ^
2025-05-07T20:31:49.5172193Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5172272Z 
2025-05-07T20:31:49.5172681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5172686Z 
2025-05-07T20:31:49.5172788Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5173004Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5173078Z     T=16384,
2025-05-07T20:31:49.5173157Z     D=5120,
2025-05-07T20:31:49.5173235Z     scale_ub=1200.0,
2025-05-07T20:31:49.5173334Z     contiguous=False,
2025-05-07T20:31:49.5173423Z     compiled=False,
2025-05-07T20:31:49.5173504Z )
2025-05-07T20:31:49.5173738Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5173914Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:49.5173919Z 
2025-05-07T20:31:49.5173991Z     @given(
2025-05-07T20:31:49.5174106Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5174203Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5174311Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5174426Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5174536Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5174606Z     )
2025-05-07T20:31:49.5174849Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5174936Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5175012Z         self,
2025-05-07T20:31:49.5175082Z         T: int,
2025-05-07T20:31:49.5175152Z         D: int,
2025-05-07T20:31:49.5175257Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5175342Z         contiguous: bool,
2025-05-07T20:31:49.5175422Z         compiled: bool,
2025-05-07T20:31:49.5175497Z     ) -> None:
2025-05-07T20:31:49.5175586Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5175654Z     
2025-05-07T20:31:49.5175821Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5175897Z     
2025-05-07T20:31:49.5175985Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5176104Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5176188Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5176262Z         x0 = x[:, :D]
2025-05-07T20:31:49.5176338Z         x1 = x[:, D:]
2025-05-07T20:31:49.5176406Z     
2025-05-07T20:31:49.5176487Z         if contiguous:
2025-05-07T20:31:49.5176572Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5176656Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5176727Z     
2025-05-07T20:31:49.5176813Z         if scale_ub is not None:
2025-05-07T20:31:49.5176918Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5177051Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5177124Z             )
2025-05-07T20:31:49.5177194Z         else:
2025-05-07T20:31:49.5177286Z             scale_ub_tensor = None
2025-05-07T20:31:49.5177361Z     
2025-05-07T20:31:49.5177489Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5177575Z             op = silu_mul_quant
2025-05-07T20:31:49.5177653Z             if compiled:
2025-05-07T20:31:49.5177753Z                 op = torch.compile(op)
2025-05-07T20:31:49.5177853Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5177922Z     
2025-05-07T20:31:49.5178009Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5178013Z 
2025-05-07T20:31:49.5178105Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5178227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5178326Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5178500Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5179003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5179096Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5180142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5180364Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5180698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5180787Z     kernel = self.compile(
2025-05-07T20:31:49.5181166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5181335Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5181466Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5181471Z 
2025-05-07T20:31:49.5181671Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02791550>
2025-05-07T20:31:49.5182445Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5182954Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02811820>}
2025-05-07T20:31:49.5183693Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5183883Z context = <triton._C.libtriton.ir.context object at 0x7faa02a69eb0>
2025-05-07T20:31:49.5183892Z 
2025-05-07T20:31:49.5184053Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5184315Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5184420Z                            module_map=module_map)
2025-05-07T20:31:49.5184585Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5184679Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5184752Z E       ^
2025-05-07T20:31:49.5185108Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5185113Z 
2025-05-07T20:31:49.5185526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5185530Z 
2025-05-07T20:31:49.5185632Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5185854Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5185927Z     T=16384,
2025-05-07T20:31:49.5185999Z     D=5120,
2025-05-07T20:31:49.5186078Z     scale_ub=1200.0,
2025-05-07T20:31:49.5186156Z     contiguous=True,
2025-05-07T20:31:49.5186238Z     compiled=True,
2025-05-07T20:31:49.5186307Z )
2025-05-07T20:31:49.5186525Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5186699Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:49.5186704Z 
2025-05-07T20:31:49.5186778Z     @given(
2025-05-07T20:31:49.5186896Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5186990Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5187101Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5187214Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5187324Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5187396Z     )
2025-05-07T20:31:49.5187734Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5187827Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5187903Z         self,
2025-05-07T20:31:49.5187972Z         T: int,
2025-05-07T20:31:49.5188043Z         D: int,
2025-05-07T20:31:49.5188141Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5188299Z         contiguous: bool,
2025-05-07T20:31:49.5188381Z         compiled: bool,
2025-05-07T20:31:49.5188458Z     ) -> None:
2025-05-07T20:31:49.5188547Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5188616Z     
2025-05-07T20:31:49.5188781Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5188850Z     
2025-05-07T20:31:49.5188937Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5189058Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5189140Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5189220Z         x0 = x[:, :D]
2025-05-07T20:31:49.5189300Z         x1 = x[:, D:]
2025-05-07T20:31:49.5189367Z     
2025-05-07T20:31:49.5189448Z         if contiguous:
2025-05-07T20:31:49.5189535Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5189621Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5189693Z     
2025-05-07T20:31:49.5189779Z         if scale_ub is not None:
2025-05-07T20:31:49.5189958Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5190093Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5190166Z             )
2025-05-07T20:31:49.5190241Z         else:
2025-05-07T20:31:49.5190336Z             scale_ub_tensor = None
2025-05-07T20:31:49.5190402Z     
2025-05-07T20:31:49.5190529Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5190619Z             op = silu_mul_quant
2025-05-07T20:31:49.5190699Z             if compiled:
2025-05-07T20:31:49.5190799Z                 op = torch.compile(op)
2025-05-07T20:31:49.5190900Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5190970Z     
2025-05-07T20:31:49.5191058Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5191063Z 
2025-05-07T20:31:49.5191155Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5191279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5191378Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5191478Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5191841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.5191930Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.5192423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5192519Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5192873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5193122Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5193484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5193574Z     kernel = self.compile(
2025-05-07T20:31:49.5193954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5194127Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5194248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5194253Z 
2025-05-07T20:31:49.5194458Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02a81fa0>
2025-05-07T20:31:49.5195234Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5195824Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02a6be50>}
2025-05-07T20:31:49.5196575Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5196863Z context = <triton._C.libtriton.ir.context object at 0x7faa028ba730>
2025-05-07T20:31:49.5196868Z 
2025-05-07T20:31:49.5197032Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5197294Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5197401Z                            module_map=module_map)
2025-05-07T20:31:49.5197562Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5197667Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5197742Z E       ^
2025-05-07T20:31:49.5198093Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5198098Z 
2025-05-07T20:31:49.5198510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5198519Z 
2025-05-07T20:31:49.5198618Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5198836Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5198913Z     T=16384,
2025-05-07T20:31:49.5198983Z     D=5120,
2025-05-07T20:31:49.5199059Z     scale_ub=None,
2025-05-07T20:31:49.5199145Z     contiguous=False,
2025-05-07T20:31:49.5199223Z     compiled=True,
2025-05-07T20:31:49.5199294Z )
2025-05-07T20:31:49.5199510Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5199687Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:49.5199692Z 
2025-05-07T20:31:49.5199769Z     @given(
2025-05-07T20:31:49.5199883Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5199977Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5200092Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5200208Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5200316Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5200390Z     )
2025-05-07T20:31:49.5200628Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5200716Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5200793Z         self,
2025-05-07T20:31:49.5200863Z         T: int,
2025-05-07T20:31:49.5200938Z         D: int,
2025-05-07T20:31:49.5201031Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5201113Z         contiguous: bool,
2025-05-07T20:31:49.5201194Z         compiled: bool,
2025-05-07T20:31:49.5201270Z     ) -> None:
2025-05-07T20:31:49.5201360Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5201433Z     
2025-05-07T20:31:49.5201597Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5201666Z     
2025-05-07T20:31:49.5201753Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5201876Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5201960Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5202039Z         x0 = x[:, :D]
2025-05-07T20:31:49.5202114Z         x1 = x[:, D:]
2025-05-07T20:31:49.5202183Z     
2025-05-07T20:31:49.5202263Z         if contiguous:
2025-05-07T20:31:49.5202350Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5202439Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5202507Z     
2025-05-07T20:31:49.5202595Z         if scale_ub is not None:
2025-05-07T20:31:49.5202698Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5202826Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5202984Z             )
2025-05-07T20:31:49.5203067Z         else:
2025-05-07T20:31:49.5203161Z             scale_ub_tensor = None
2025-05-07T20:31:49.5203247Z     
2025-05-07T20:31:49.5203390Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5203486Z             op = silu_mul_quant
2025-05-07T20:31:49.5203640Z             if compiled:
2025-05-07T20:31:49.5203913Z                 op = torch.compile(op)
2025-05-07T20:31:49.5204018Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5204091Z     
2025-05-07T20:31:49.5204176Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5204180Z 
2025-05-07T20:31:49.5204273Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5204402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5204498Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5204593Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5204967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.5205055Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.5205549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5205646Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5205996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5206221Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5206551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5206641Z     kernel = self.compile(
2025-05-07T20:31:49.5207018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5207191Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5207317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5207321Z 
2025-05-07T20:31:49.5207524Z self = <triton.compiler.compiler.ASTSource object at 0x7faa0288dc70>
2025-05-07T20:31:49.5208298Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5208809Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02ae09d0>}
2025-05-07T20:31:49.5209550Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5209747Z context = <triton._C.libtriton.ir.context object at 0x7faa02538cf0>
2025-05-07T20:31:49.5209751Z 
2025-05-07T20:31:49.5209915Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5210177Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5210286Z                            module_map=module_map)
2025-05-07T20:31:49.5210444Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5210543Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5210616Z E       ^
2025-05-07T20:31:49.5210967Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5210972Z 
2025-05-07T20:31:49.5211384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5211388Z 
2025-05-07T20:31:49.5211487Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5211919Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5211996Z     T=2048,
2025-05-07T20:31:49.5212069Z     D=5120,
2025-05-07T20:31:49.5212149Z     scale_ub=None,
2025-05-07T20:31:49.5212231Z     contiguous=False,
2025-05-07T20:31:49.5212306Z     compiled=True,
2025-05-07T20:31:49.5212500Z )
2025-05-07T20:31:49.5212714Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5212880Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:49.5212884Z 
2025-05-07T20:31:49.5212961Z     @given(
2025-05-07T20:31:49.5213075Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5213176Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5213285Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5213397Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5213510Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5213586Z     )
2025-05-07T20:31:49.5213826Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5213919Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5213991Z         self,
2025-05-07T20:31:49.5214066Z         T: int,
2025-05-07T20:31:49.5214144Z         D: int,
2025-05-07T20:31:49.5214240Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5214327Z         contiguous: bool,
2025-05-07T20:31:49.5214411Z         compiled: bool,
2025-05-07T20:31:49.5214487Z     ) -> None:
2025-05-07T20:31:49.5214578Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5214647Z     
2025-05-07T20:31:49.5214810Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5214882Z     
2025-05-07T20:31:49.5214970Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5215087Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5215177Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5215256Z         x0 = x[:, :D]
2025-05-07T20:31:49.5215330Z         x1 = x[:, D:]
2025-05-07T20:31:49.5215398Z     
2025-05-07T20:31:49.5215476Z         if contiguous:
2025-05-07T20:31:49.5215561Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5215651Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5215719Z     
2025-05-07T20:31:49.5215816Z         if scale_ub is not None:
2025-05-07T20:31:49.5215915Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5216046Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5216123Z             )
2025-05-07T20:31:49.5216193Z         else:
2025-05-07T20:31:49.5216284Z             scale_ub_tensor = None
2025-05-07T20:31:49.5216356Z     
2025-05-07T20:31:49.5216484Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5216567Z             op = silu_mul_quant
2025-05-07T20:31:49.5216652Z             if compiled:
2025-05-07T20:31:49.5216748Z                 op = torch.compile(op)
2025-05-07T20:31:49.5216853Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5216924Z     
2025-05-07T20:31:49.5217009Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5217013Z 
2025-05-07T20:31:49.5217105Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5217227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5217326Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5217426Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5217790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.5217876Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.5218367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5218460Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5218898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5219120Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5219454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5219552Z     kernel = self.compile(
2025-05-07T20:31:49.5220002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5220172Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5220297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5220302Z 
2025-05-07T20:31:49.5220503Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02509220>
2025-05-07T20:31:49.5221288Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5221791Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa0270a550>}
2025-05-07T20:31:49.5222537Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5222733Z context = <triton._C.libtriton.ir.context object at 0x7faa024af270>
2025-05-07T20:31:49.5222738Z 
2025-05-07T20:31:49.5222901Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5223207Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5223319Z                            module_map=module_map)
2025-05-07T20:31:49.5223481Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5223581Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5223655Z E       ^
2025-05-07T20:31:49.5224009Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5224014Z 
2025-05-07T20:31:49.5224421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5224431Z 
2025-05-07T20:31:49.5224538Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5224755Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5224827Z     T=2048,
2025-05-07T20:31:49.5224900Z     D=5120,
2025-05-07T20:31:49.5224976Z     scale_ub=1200.0,
2025-05-07T20:31:49.5225060Z     contiguous=False,
2025-05-07T20:31:49.5225139Z     compiled=True,
2025-05-07T20:31:49.5225207Z )
2025-05-07T20:31:49.5225424Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5225599Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:49.5225604Z 
2025-05-07T20:31:49.5230187Z     @given(
2025-05-07T20:31:49.5230322Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5230421Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5230545Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5230655Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5230764Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5230839Z     )
2025-05-07T20:31:49.5231086Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5231178Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5231255Z         self,
2025-05-07T20:31:49.5231326Z         T: int,
2025-05-07T20:31:49.5231397Z         D: int,
2025-05-07T20:31:49.5231493Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5231577Z         contiguous: bool,
2025-05-07T20:31:49.5231765Z         compiled: bool,
2025-05-07T20:31:49.5231843Z     ) -> None:
2025-05-07T20:31:49.5231934Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5232009Z     
2025-05-07T20:31:49.5232181Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5232252Z     
2025-05-07T20:31:49.5232445Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5232567Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5232649Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5232725Z         x0 = x[:, :D]
2025-05-07T20:31:49.5232798Z         x1 = x[:, D:]
2025-05-07T20:31:49.5232867Z     
2025-05-07T20:31:49.5232948Z         if contiguous:
2025-05-07T20:31:49.5233035Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5233128Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5233195Z     
2025-05-07T20:31:49.5233301Z         if scale_ub is not None:
2025-05-07T20:31:49.5233412Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5233572Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5233645Z             )
2025-05-07T20:31:49.5233718Z         else:
2025-05-07T20:31:49.5233810Z             scale_ub_tensor = None
2025-05-07T20:31:49.5233878Z     
2025-05-07T20:31:49.5234006Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5234096Z             op = silu_mul_quant
2025-05-07T20:31:49.5234176Z             if compiled:
2025-05-07T20:31:49.5234276Z                 op = torch.compile(op)
2025-05-07T20:31:49.5234377Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5234449Z     
2025-05-07T20:31:49.5234537Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5234542Z 
2025-05-07T20:31:49.5234638Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5234772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5234869Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5234965Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5235346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.5235435Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.5235927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5236032Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5236385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5236607Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5236942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5237033Z     kernel = self.compile(
2025-05-07T20:31:49.5237411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5237594Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5237720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5237725Z 
2025-05-07T20:31:49.5237928Z self = <triton.compiler.compiler.ASTSource object at 0x7faa024ae670>
2025-05-07T20:31:49.5238711Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5239220Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02547310>}
2025-05-07T20:31:49.5240042Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5240237Z context = <triton._C.libtriton.ir.context object at 0x7faa025468b0>
2025-05-07T20:31:49.5240242Z 
2025-05-07T20:31:49.5240407Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5240665Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5240847Z                            module_map=module_map)
2025-05-07T20:31:49.5241009Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5241105Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5241178Z E       ^
2025-05-07T20:31:49.5241529Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5241534Z 
2025-05-07T20:31:49.5241942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5241947Z 
2025-05-07T20:31:49.5242048Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5242270Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5242344Z     T=4096,
2025-05-07T20:31:49.5242415Z     D=5120,
2025-05-07T20:31:49.5242500Z     scale_ub=1200.0,
2025-05-07T20:31:49.5242579Z     contiguous=True,
2025-05-07T20:31:49.5242662Z     compiled=True,
2025-05-07T20:31:49.5242736Z )
2025-05-07T20:31:49.5242950Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5243115Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:49.5243119Z 
2025-05-07T20:31:49.5243193Z     @given(
2025-05-07T20:31:49.5243307Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5243402Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5243512Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5243623Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5243736Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5243807Z     )
2025-05-07T20:31:49.5244047Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5244141Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5244211Z         self,
2025-05-07T20:31:49.5244288Z         T: int,
2025-05-07T20:31:49.5244365Z         D: int,
2025-05-07T20:31:49.5244458Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5244546Z         contiguous: bool,
2025-05-07T20:31:49.5244628Z         compiled: bool,
2025-05-07T20:31:49.5244700Z     ) -> None:
2025-05-07T20:31:49.5244793Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5244861Z     
2025-05-07T20:31:49.5245028Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5245101Z     
2025-05-07T20:31:49.5245187Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5245307Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5245395Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5245470Z         x0 = x[:, :D]
2025-05-07T20:31:49.5245543Z         x1 = x[:, D:]
2025-05-07T20:31:49.5245615Z     
2025-05-07T20:31:49.5245694Z         if contiguous:
2025-05-07T20:31:49.5245777Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5245863Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5245938Z     
2025-05-07T20:31:49.5246023Z         if scale_ub is not None:
2025-05-07T20:31:49.5246125Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5246255Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5246333Z             )
2025-05-07T20:31:49.5246406Z         else:
2025-05-07T20:31:49.5246494Z             scale_ub_tensor = None
2025-05-07T20:31:49.5246566Z     
2025-05-07T20:31:49.5246689Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5246778Z             op = silu_mul_quant
2025-05-07T20:31:49.5246860Z             if compiled:
2025-05-07T20:31:49.5247041Z                 op = torch.compile(op)
2025-05-07T20:31:49.5247146Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5247216Z     
2025-05-07T20:31:49.5247303Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5247307Z 
2025-05-07T20:31:49.5247402Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5247527Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5247701Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5247798Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5248160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.5248249Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.5248740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5248834Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5249195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5249414Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5249747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5249847Z     kernel = self.compile(
2025-05-07T20:31:49.5250222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5250395Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5250518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5250522Z 
2025-05-07T20:31:49.5250726Z self = <triton.compiler.compiler.ASTSource object at 0x7faa027d0b20>
2025-05-07T20:31:49.5251509Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5252015Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa024ca040>}
2025-05-07T20:31:49.5252762Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5252955Z context = <triton._C.libtriton.ir.context object at 0x7faa024e0030>
2025-05-07T20:31:49.5252960Z 
2025-05-07T20:31:49.5253139Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5253434Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5253538Z                            module_map=module_map)
2025-05-07T20:31:49.5253704Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5253799Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5253872Z E       ^
2025-05-07T20:31:49.5254226Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5254234Z 
2025-05-07T20:31:49.5254642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5254646Z 
2025-05-07T20:31:49.5254747Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5254965Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5255039Z     T=128,
2025-05-07T20:31:49.5255113Z     D=5120,
2025-05-07T20:31:49.5255191Z     scale_ub=1200.0,
2025-05-07T20:31:49.5255270Z     contiguous=False,
2025-05-07T20:31:49.5255351Z     compiled=True,
2025-05-07T20:31:49.5255420Z )
2025-05-07T20:31:49.5255712Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5255887Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:49.5255892Z 
2025-05-07T20:31:49.5255965Z     @given(
2025-05-07T20:31:49.5256083Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5256177Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5256364Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5256478Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5256587Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5256653Z     )
2025-05-07T20:31:49.5256903Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5256991Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5257066Z         self,
2025-05-07T20:31:49.5257139Z         T: int,
2025-05-07T20:31:49.5257211Z         D: int,
2025-05-07T20:31:49.5257306Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5257395Z         contiguous: bool,
2025-05-07T20:31:49.5257474Z         compiled: bool,
2025-05-07T20:31:49.5257552Z     ) -> None:
2025-05-07T20:31:49.5257643Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5257711Z     
2025-05-07T20:31:49.5257879Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5257951Z     
2025-05-07T20:31:49.5258038Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5258164Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5258248Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5258324Z         x0 = x[:, :D]
2025-05-07T20:31:49.5258402Z         x1 = x[:, D:]
2025-05-07T20:31:49.5258471Z     
2025-05-07T20:31:49.5258552Z         if contiguous:
2025-05-07T20:31:49.5258640Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5258723Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5258794Z     
2025-05-07T20:31:49.5258881Z         if scale_ub is not None:
2025-05-07T20:31:49.5258986Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5259121Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5259194Z             )
2025-05-07T20:31:49.5259267Z         else:
2025-05-07T20:31:49.5259359Z             scale_ub_tensor = None
2025-05-07T20:31:49.5259425Z     
2025-05-07T20:31:49.5259555Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5259644Z             op = silu_mul_quant
2025-05-07T20:31:49.5259724Z             if compiled:
2025-05-07T20:31:49.5259823Z                 op = torch.compile(op)
2025-05-07T20:31:49.5259925Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5259995Z     
2025-05-07T20:31:49.5260082Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5260087Z 
2025-05-07T20:31:49.5260178Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5260301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5260401Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5260499Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5260865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.5260955Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.5261444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5261543Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5261894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5262113Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5262451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5262538Z     kernel = self.compile(
2025-05-07T20:31:49.5263020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5263196Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5263318Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5263323Z 
2025-05-07T20:31:49.5263561Z self = <triton.compiler.compiler.ASTSource object at 0x7faa026e4cd0>
2025-05-07T20:31:49.5264432Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5264939Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa024caca0>}
2025-05-07T20:31:49.5265693Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5265881Z context = <triton._C.libtriton.ir.context object at 0x7faa02585530>
2025-05-07T20:31:49.5265886Z 
2025-05-07T20:31:49.5266048Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5266315Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5266423Z                            module_map=module_map)
2025-05-07T20:31:49.5266582Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5266675Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5266753Z E       ^
2025-05-07T20:31:49.5267103Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5267108Z 
2025-05-07T20:31:49.5267523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5267532Z 
2025-05-07T20:31:49.5267634Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5267852Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5267928Z     T=16384,
2025-05-07T20:31:49.5268001Z     D=7168,
2025-05-07T20:31:49.5268080Z     scale_ub=1200.0,
2025-05-07T20:31:49.5268167Z     contiguous=True,
2025-05-07T20:31:49.5268248Z     compiled=True,
2025-05-07T20:31:49.5268318Z )
2025-05-07T20:31:49.5268535Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5268703Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:49.5268708Z 
2025-05-07T20:31:49.5268783Z     @given(
2025-05-07T20:31:49.5268899Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5268993Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5269106Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5269222Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5269331Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5269409Z     )
2025-05-07T20:31:49.5269651Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5269739Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5269868Z         self,
2025-05-07T20:31:49.5269940Z         T: int,
2025-05-07T20:31:49.5270009Z         D: int,
2025-05-07T20:31:49.5270108Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5270191Z         contiguous: bool,
2025-05-07T20:31:49.5270273Z         compiled: bool,
2025-05-07T20:31:49.5270347Z     ) -> None:
2025-05-07T20:31:49.5270436Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5270508Z     
2025-05-07T20:31:49.5270670Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5270741Z     
2025-05-07T20:31:49.5270833Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5271034Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5271120Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5271197Z         x0 = x[:, :D]
2025-05-07T20:31:49.5271274Z         x1 = x[:, D:]
2025-05-07T20:31:49.5271344Z     
2025-05-07T20:31:49.5271425Z         if contiguous:
2025-05-07T20:31:49.5271510Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5271678Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5271750Z     
2025-05-07T20:31:49.5271836Z         if scale_ub is not None:
2025-05-07T20:31:49.5271947Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5272076Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5272147Z             )
2025-05-07T20:31:49.5272222Z         else:
2025-05-07T20:31:49.5272311Z             scale_ub_tensor = None
2025-05-07T20:31:49.5272380Z     
2025-05-07T20:31:49.5272507Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5272592Z             op = silu_mul_quant
2025-05-07T20:31:49.5272677Z             if compiled:
2025-05-07T20:31:49.5272775Z                 op = torch.compile(op)
2025-05-07T20:31:49.5272878Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5272949Z     
2025-05-07T20:31:49.5273036Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5273041Z 
2025-05-07T20:31:49.5273132Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5273271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5273367Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5273459Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5273825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.5273913Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.5274402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5274497Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5274853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5275084Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5275416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5275508Z     kernel = self.compile(
2025-05-07T20:31:49.5275889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5276062Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5276187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5276191Z 
2025-05-07T20:31:49.5276395Z self = <triton.compiler.compiler.ASTSource object at 0x7faa0258a460>
2025-05-07T20:31:49.5277176Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5277680Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02697a60>}
2025-05-07T20:31:49.5278423Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5278613Z context = <triton._C.libtriton.ir.context object at 0x7faa025c4b70>
2025-05-07T20:31:49.5278618Z 
2025-05-07T20:31:49.5278777Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5279038Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5279224Z                            module_map=module_map)
2025-05-07T20:31:49.5279386Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5279481Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5279554Z E       ^
2025-05-07T20:31:49.5279904Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5279982Z 
2025-05-07T20:31:49.5280396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5280401Z 
2025-05-07T20:31:49.5280498Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5280719Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5280793Z     T=16384,
2025-05-07T20:31:49.5280866Z     D=5120,
2025-05-07T20:31:49.5280944Z     scale_ub=1200.0,
2025-05-07T20:31:49.5281025Z     contiguous=True,
2025-05-07T20:31:49.5281107Z     compiled=False,
2025-05-07T20:31:49.5281181Z )
2025-05-07T20:31:49.5281405Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5281576Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:49.5281581Z 
2025-05-07T20:31:49.5281657Z     @given(
2025-05-07T20:31:49.5281772Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5281875Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5281986Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5282097Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5282211Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5282281Z     )
2025-05-07T20:31:49.5282525Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5282615Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5282687Z         self,
2025-05-07T20:31:49.5282757Z         T: int,
2025-05-07T20:31:49.5282830Z         D: int,
2025-05-07T20:31:49.5282931Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5283014Z         contiguous: bool,
2025-05-07T20:31:49.5283115Z         compiled: bool,
2025-05-07T20:31:49.5283193Z     ) -> None:
2025-05-07T20:31:49.5283306Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5283379Z     
2025-05-07T20:31:49.5283543Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5283621Z     
2025-05-07T20:31:49.5283708Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5283827Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5283911Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5283989Z         x0 = x[:, :D]
2025-05-07T20:31:49.5284064Z         x1 = x[:, D:]
2025-05-07T20:31:49.5284136Z     
2025-05-07T20:31:49.5284213Z         if contiguous:
2025-05-07T20:31:49.5284297Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5284389Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5284457Z     
2025-05-07T20:31:49.5284548Z         if scale_ub is not None:
2025-05-07T20:31:49.5284651Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5284782Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5284860Z             )
2025-05-07T20:31:49.5284932Z         else:
2025-05-07T20:31:49.5285022Z             scale_ub_tensor = None
2025-05-07T20:31:49.5285101Z     
2025-05-07T20:31:49.5285225Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5285311Z             op = silu_mul_quant
2025-05-07T20:31:49.5285394Z             if compiled:
2025-05-07T20:31:49.5285489Z                 op = torch.compile(op)
2025-05-07T20:31:49.5285590Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5285659Z     
2025-05-07T20:31:49.5285744Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5285748Z 
2025-05-07T20:31:49.5285842Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5285966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5286145Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5286245Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5286743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5286834Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5287273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5287490Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5287827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5287915Z     kernel = self.compile(
2025-05-07T20:31:49.5288292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5288465Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5288590Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5288595Z 
2025-05-07T20:31:49.5288799Z self = <triton.compiler.compiler.ASTSource object at 0x7faa0235b970>
2025-05-07T20:31:49.5289574Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5290085Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02423550>}
2025-05-07T20:31:49.5290830Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5291022Z context = <triton._C.libtriton.ir.context object at 0x7faa02410670>
2025-05-07T20:31:49.5291027Z 
2025-05-07T20:31:49.5291191Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5291449Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5291550Z                            module_map=module_map)
2025-05-07T20:31:49.5291715Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5291808Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5291880Z E       ^
2025-05-07T20:31:49.5292236Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5292241Z 
2025-05-07T20:31:49.5292650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5292654Z 
2025-05-07T20:31:49.5292756Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5292978Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5293051Z     T=1,
2025-05-07T20:31:49.5293129Z     D=7168,
2025-05-07T20:31:49.5293207Z     scale_ub=1200.0,
2025-05-07T20:31:49.5293307Z     contiguous=False,
2025-05-07T20:31:49.5293397Z     compiled=False,
2025-05-07T20:31:49.5293485Z )
2025-05-07T20:31:49.5293708Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5293870Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:49.5293875Z 
2025-05-07T20:31:49.5293947Z     @given(
2025-05-07T20:31:49.5294063Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5294157Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5294269Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5294385Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5294496Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5294710Z     )
2025-05-07T20:31:49.5294953Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5295040Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5295115Z         self,
2025-05-07T20:31:49.5295187Z         T: int,
2025-05-07T20:31:49.5295259Z         D: int,
2025-05-07T20:31:49.5295433Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5295516Z         contiguous: bool,
2025-05-07T20:31:49.5295595Z         compiled: bool,
2025-05-07T20:31:49.5295671Z     ) -> None:
2025-05-07T20:31:49.5295762Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5295830Z     
2025-05-07T20:31:49.5295999Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5296068Z     
2025-05-07T20:31:49.5296155Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5296276Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5296359Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5296441Z         x0 = x[:, :D]
2025-05-07T20:31:49.5296524Z         x1 = x[:, D:]
2025-05-07T20:31:49.5296592Z     
2025-05-07T20:31:49.5296673Z         if contiguous:
2025-05-07T20:31:49.5296759Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5296843Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5296915Z     
2025-05-07T20:31:49.5297007Z         if scale_ub is not None:
2025-05-07T20:31:49.5297115Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5297247Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5297320Z             )
2025-05-07T20:31:49.5297392Z         else:
2025-05-07T20:31:49.5297484Z             scale_ub_tensor = None
2025-05-07T20:31:49.5297556Z     
2025-05-07T20:31:49.5297684Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5297768Z             op = silu_mul_quant
2025-05-07T20:31:49.5297847Z             if compiled:
2025-05-07T20:31:49.5297942Z                 op = torch.compile(op)
2025-05-07T20:31:49.5298047Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5298116Z     
2025-05-07T20:31:49.5298204Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5298208Z 
2025-05-07T20:31:49.5298301Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5298426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5298529Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5298624Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5299124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5299218Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5299573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5299795Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5300133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5300223Z     kernel = self.compile(
2025-05-07T20:31:49.5300601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5300773Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5300903Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5300908Z 
2025-05-07T20:31:49.5301111Z self = <triton.compiler.compiler.ASTSource object at 0x7faa024176d0>
2025-05-07T20:31:49.5301883Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5302472Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02697e50>}
2025-05-07T20:31:49.5303237Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5303459Z context = <triton._C.libtriton.ir.context object at 0x7faa0235d730>
2025-05-07T20:31:49.5303537Z 
2025-05-07T20:31:49.5303954Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5304220Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5304324Z                            module_map=module_map)
2025-05-07T20:31:49.5304481Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5304576Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5304649Z E       ^
2025-05-07T20:31:49.5305006Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5305010Z 
2025-05-07T20:31:49.5305420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5305424Z 
2025-05-07T20:31:49.5305523Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5305748Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5305823Z     T=4096,
2025-05-07T20:31:49.5305894Z     D=7168,
2025-05-07T20:31:49.5305978Z     scale_ub=1200.0,
2025-05-07T20:31:49.5306059Z     contiguous=False,
2025-05-07T20:31:49.5306138Z     compiled=True,
2025-05-07T20:31:49.5306207Z )
2025-05-07T20:31:49.5306421Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5306596Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:49.5306600Z 
2025-05-07T20:31:49.5306674Z     @given(
2025-05-07T20:31:49.5306796Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5306891Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5307001Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5307116Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5307225Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5307299Z     )
2025-05-07T20:31:49.5307543Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5307631Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5307706Z         self,
2025-05-07T20:31:49.5307779Z         T: int,
2025-05-07T20:31:49.5307852Z         D: int,
2025-05-07T20:31:49.5307948Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5308032Z         contiguous: bool,
2025-05-07T20:31:49.5308111Z         compiled: bool,
2025-05-07T20:31:49.5308188Z     ) -> None:
2025-05-07T20:31:49.5308277Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5308348Z     
2025-05-07T20:31:49.5308524Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5308594Z     
2025-05-07T20:31:49.5308680Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5308802Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5308890Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5308970Z         x0 = x[:, :D]
2025-05-07T20:31:49.5309050Z         x1 = x[:, D:]
2025-05-07T20:31:49.5309120Z     
2025-05-07T20:31:49.5309201Z         if contiguous:
2025-05-07T20:31:49.5309290Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5309375Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5309446Z     
2025-05-07T20:31:49.5309534Z         if scale_ub is not None:
2025-05-07T20:31:49.5309635Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5309767Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5309887Z             )
2025-05-07T20:31:49.5309961Z         else:
2025-05-07T20:31:49.5310054Z             scale_ub_tensor = None
2025-05-07T20:31:49.5310254Z     
2025-05-07T20:31:49.5310382Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5310471Z             op = silu_mul_quant
2025-05-07T20:31:49.5310551Z             if compiled:
2025-05-07T20:31:49.5310651Z                 op = torch.compile(op)
2025-05-07T20:31:49.5310752Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5310934Z     
2025-05-07T20:31:49.5311023Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5311027Z 
2025-05-07T20:31:49.5311124Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5311250Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5311352Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5311446Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5311807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.5311899Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.5312399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5312495Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5312847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5313096Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5313458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5313548Z     kernel = self.compile(
2025-05-07T20:31:49.5313925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5314096Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5314216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5314221Z 
2025-05-07T20:31:49.5314431Z self = <triton.compiler.compiler.ASTSource object at 0x7faa023641c0>
2025-05-07T20:31:49.5315205Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5315715Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa0234eee0>}
2025-05-07T20:31:49.5316461Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5316650Z context = <triton._C.libtriton.ir.context object at 0x7faa02450bf0>
2025-05-07T20:31:49.5316655Z 
2025-05-07T20:31:49.5316824Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5317083Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5317193Z                            module_map=module_map)
2025-05-07T20:31:49.5317354Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5317453Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5317531Z E       ^
2025-05-07T20:31:49.5317882Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5317887Z 
2025-05-07T20:31:49.5318299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5318304Z 
2025-05-07T20:31:49.5318403Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5318621Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5318698Z     T=128,
2025-05-07T20:31:49.5318853Z     D=7168,
2025-05-07T20:31:49.5318931Z     scale_ub=1200.0,
2025-05-07T20:31:49.5319015Z     contiguous=False,
2025-05-07T20:31:49.5319092Z     compiled=True,
2025-05-07T20:31:49.5319161Z )
2025-05-07T20:31:49.5319379Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5319545Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:49.5319649Z 
2025-05-07T20:31:49.5319727Z     @given(
2025-05-07T20:31:49.5319841Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5319934Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5320047Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5320158Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5320267Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5320338Z     )
2025-05-07T20:31:49.5320578Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5320674Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5320748Z         self,
2025-05-07T20:31:49.5320819Z         T: int,
2025-05-07T20:31:49.5320894Z         D: int,
2025-05-07T20:31:49.5320989Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5321072Z         contiguous: bool,
2025-05-07T20:31:49.5321156Z         compiled: bool,
2025-05-07T20:31:49.5321237Z     ) -> None:
2025-05-07T20:31:49.5321324Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5321396Z     
2025-05-07T20:31:49.5321560Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5321630Z     
2025-05-07T20:31:49.5321719Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5321838Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5321922Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5322000Z         x0 = x[:, :D]
2025-05-07T20:31:49.5322076Z         x1 = x[:, D:]
2025-05-07T20:31:49.5322146Z     
2025-05-07T20:31:49.5322224Z         if contiguous:
2025-05-07T20:31:49.5322316Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5322403Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5322470Z     
2025-05-07T20:31:49.5322557Z         if scale_ub is not None:
2025-05-07T20:31:49.5322658Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5322790Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5322866Z             )
2025-05-07T20:31:49.5322939Z         else:
2025-05-07T20:31:49.5323031Z             scale_ub_tensor = None
2025-05-07T20:31:49.5323100Z     
2025-05-07T20:31:49.5323233Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5323338Z             op = silu_mul_quant
2025-05-07T20:31:49.5323421Z             if compiled:
2025-05-07T20:31:49.5323539Z                 op = torch.compile(op)
2025-05-07T20:31:49.5323640Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5323714Z     
2025-05-07T20:31:49.5323799Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5323804Z 
2025-05-07T20:31:49.5323900Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5324027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5324121Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5324215Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5324588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.5324674Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.5325167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5325259Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5325613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5325835Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5326251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5326347Z     kernel = self.compile(
2025-05-07T20:31:49.5326727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5326975Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5327099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5327104Z 
2025-05-07T20:31:49.5327308Z self = <triton.compiler.compiler.ASTSource object at 0x7faa0247f4f0>
2025-05-07T20:31:49.5328080Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5328589Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa026c8af0>}
2025-05-07T20:31:49.5329331Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5329526Z context = <triton._C.libtriton.ir.context object at 0x7faa02286130>
2025-05-07T20:31:49.5329531Z 
2025-05-07T20:31:49.5329696Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5329955Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5330056Z                            module_map=module_map)
2025-05-07T20:31:49.5330213Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5330308Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5330381Z E       ^
2025-05-07T20:31:49.5330736Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5330741Z 
2025-05-07T20:31:49.5331159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5331164Z 
2025-05-07T20:31:49.5331263Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5331490Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5331563Z     T=2048,
2025-05-07T20:31:49.5331637Z     D=7168,
2025-05-07T20:31:49.5331716Z     scale_ub=None,
2025-05-07T20:31:49.5331796Z     contiguous=True,
2025-05-07T20:31:49.5331872Z     compiled=True,
2025-05-07T20:31:49.5331946Z )
2025-05-07T20:31:49.5332159Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5332324Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:49.5332329Z 
2025-05-07T20:31:49.5332402Z     @given(
2025-05-07T20:31:49.5332521Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5332617Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5332727Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5332838Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5332951Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5333027Z     )
2025-05-07T20:31:49.5333268Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5333359Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5333431Z         self,
2025-05-07T20:31:49.5333501Z         T: int,
2025-05-07T20:31:49.5333575Z         D: int,
2025-05-07T20:31:49.5333668Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5333757Z         contiguous: bool,
2025-05-07T20:31:49.5333836Z         compiled: bool,
2025-05-07T20:31:49.5333907Z     ) -> None:
2025-05-07T20:31:49.5334000Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5334068Z     
2025-05-07T20:31:49.5334315Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5334393Z     
2025-05-07T20:31:49.5334480Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5334601Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5334689Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5334916Z         x0 = x[:, :D]
2025-05-07T20:31:49.5334991Z         x1 = x[:, D:]
2025-05-07T20:31:49.5335061Z     
2025-05-07T20:31:49.5335140Z         if contiguous:
2025-05-07T20:31:49.5335225Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5335311Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5335378Z     
2025-05-07T20:31:49.5335467Z         if scale_ub is not None:
2025-05-07T20:31:49.5335569Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5335697Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5335770Z             )
2025-05-07T20:31:49.5335840Z         else:
2025-05-07T20:31:49.5335936Z             scale_ub_tensor = None
2025-05-07T20:31:49.5336008Z     
2025-05-07T20:31:49.5336132Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5336216Z             op = silu_mul_quant
2025-05-07T20:31:49.5336301Z             if compiled:
2025-05-07T20:31:49.5336395Z                 op = torch.compile(op)
2025-05-07T20:31:49.5336507Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5336579Z     
2025-05-07T20:31:49.5336666Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5336671Z 
2025-05-07T20:31:49.5336770Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5336893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5336992Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5337090Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5337453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.5337544Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.5338040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5338134Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5338491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5338713Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5339045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5339137Z     kernel = self.compile(
2025-05-07T20:31:49.5339514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5339683Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5339812Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5339817Z 
2025-05-07T20:31:49.5340019Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02138130>
2025-05-07T20:31:49.5340797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5341305Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa021238b0>}
2025-05-07T20:31:49.5342055Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5342244Z context = <triton._C.libtriton.ir.context object at 0x7faa02304a70>
2025-05-07T20:31:49.5342248Z 
2025-05-07T20:31:49.5342491Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5342756Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5342859Z                            module_map=module_map)
2025-05-07T20:31:49.5343020Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5343213Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5343292Z E       ^
2025-05-07T20:31:49.5343665Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5343670Z 
2025-05-07T20:31:49.5344079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5344083Z 
2025-05-07T20:31:49.5344181Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5344402Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5344481Z     T=16384,
2025-05-07T20:31:49.5344560Z     D=5120,
2025-05-07T20:31:49.5344637Z     scale_ub=None,
2025-05-07T20:31:49.5344719Z     contiguous=False,
2025-05-07T20:31:49.5344800Z     compiled=False,
2025-05-07T20:31:49.5344869Z )
2025-05-07T20:31:49.5345082Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5345262Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:49.5345267Z 
2025-05-07T20:31:49.5345340Z     @given(
2025-05-07T20:31:49.5345461Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5345555Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5345667Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5345779Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5345887Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5345959Z     )
2025-05-07T20:31:49.5346203Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5346292Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5346367Z         self,
2025-05-07T20:31:49.5346439Z         T: int,
2025-05-07T20:31:49.5350962Z         D: int,
2025-05-07T20:31:49.5351079Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5351166Z         contiguous: bool,
2025-05-07T20:31:49.5351258Z         compiled: bool,
2025-05-07T20:31:49.5351331Z     ) -> None:
2025-05-07T20:31:49.5351423Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5351494Z     
2025-05-07T20:31:49.5351666Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5351736Z     
2025-05-07T20:31:49.5351827Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5351949Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5353814Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.5353825Z 
2025-05-07T20:31:49.5353939Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:49.5353944Z 
2025-05-07T20:31:49.5354040Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5354267Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5354338Z     T=4096,
2025-05-07T20:31:49.5354414Z     D=7168,
2025-05-07T20:31:49.5354491Z     scale_ub=1200.0,
2025-05-07T20:31:49.5354567Z     contiguous=True,
2025-05-07T20:31:49.5354648Z     compiled=True,
2025-05-07T20:31:49.5354717Z )
2025-05-07T20:31:49.5355034Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5355206Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:49.5355211Z 
2025-05-07T20:31:49.5355282Z     @given(
2025-05-07T20:31:49.5355395Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5355590Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5355700Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5355815Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5355925Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5355996Z     )
2025-05-07T20:31:49.5356243Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5356332Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5356406Z         self,
2025-05-07T20:31:49.5356480Z         T: int,
2025-05-07T20:31:49.5356550Z         D: int,
2025-05-07T20:31:49.5356643Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5356733Z         contiguous: bool,
2025-05-07T20:31:49.5356814Z         compiled: bool,
2025-05-07T20:31:49.5356888Z     ) -> None:
2025-05-07T20:31:49.5356983Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5357050Z     
2025-05-07T20:31:49.5357219Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5357294Z     
2025-05-07T20:31:49.5357385Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5357510Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5359311Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.5359317Z 
2025-05-07T20:31:49.5359433Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:49.5359438Z 
2025-05-07T20:31:49.5359536Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5359761Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5359835Z     T=16384,
2025-05-07T20:31:49.5359910Z     D=7168,
2025-05-07T20:31:49.5359985Z     scale_ub=None,
2025-05-07T20:31:49.5360071Z     contiguous=False,
2025-05-07T20:31:49.5360150Z     compiled=False,
2025-05-07T20:31:49.5360225Z )
2025-05-07T20:31:49.5360437Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5360610Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:49.5360615Z 
2025-05-07T20:31:49.5360690Z     @given(
2025-05-07T20:31:49.5360813Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5360907Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5361019Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5361131Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5361241Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5361320Z     )
2025-05-07T20:31:49.5361561Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5361652Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5361726Z         self,
2025-05-07T20:31:49.5361795Z         T: int,
2025-05-07T20:31:49.5361870Z         D: int,
2025-05-07T20:31:49.5361962Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5362044Z         contiguous: bool,
2025-05-07T20:31:49.5362127Z         compiled: bool,
2025-05-07T20:31:49.5362200Z     ) -> None:
2025-05-07T20:31:49.5362290Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5362362Z     
2025-05-07T20:31:49.5362606Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5364399Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.5364477Z 
2025-05-07T20:31:49.5364591Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.5364596Z 
2025-05-07T20:31:49.5364699Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5364920Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5364993Z     T=2048,
2025-05-07T20:31:49.5365072Z     D=7168,
2025-05-07T20:31:49.5365150Z     scale_ub=1200.0,
2025-05-07T20:31:49.5365227Z     contiguous=True,
2025-05-07T20:31:49.5365305Z     compiled=True,
2025-05-07T20:31:49.5365374Z )
2025-05-07T20:31:49.5365584Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5365755Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:49.5365761Z 
2025-05-07T20:31:49.5365835Z     @given(
2025-05-07T20:31:49.5365952Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5366048Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5366156Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5366268Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5366377Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5366445Z     )
2025-05-07T20:31:49.5366691Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5366780Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5366852Z         self,
2025-05-07T20:31:49.5366929Z         T: int,
2025-05-07T20:31:49.5366998Z         D: int,
2025-05-07T20:31:49.5367090Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5367174Z         contiguous: bool,
2025-05-07T20:31:49.5367259Z         compiled: bool,
2025-05-07T20:31:49.5367336Z     ) -> None:
2025-05-07T20:31:49.5367424Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5367494Z     
2025-05-07T20:31:49.5367659Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5367726Z     
2025-05-07T20:31:49.5367813Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5367932Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5369718Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.5369729Z 
2025-05-07T20:31:49.5369848Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:49.5369852Z 
2025-05-07T20:31:49.5369949Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5370167Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5370242Z     T=2048,
2025-05-07T20:31:49.5370313Z     D=7168,
2025-05-07T20:31:49.5370391Z     scale_ub=None,
2025-05-07T20:31:49.5370468Z     contiguous=True,
2025-05-07T20:31:49.5370547Z     compiled=False,
2025-05-07T20:31:49.5370617Z )
2025-05-07T20:31:49.5370909Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5371078Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:49.5371086Z 
2025-05-07T20:31:49.5371158Z     @given(
2025-05-07T20:31:49.5371271Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5371366Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5371550Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5371660Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5371772Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5371846Z     )
2025-05-07T20:31:49.5372087Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5372178Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5372251Z         self,
2025-05-07T20:31:49.5372328Z         T: int,
2025-05-07T20:31:49.5372401Z         D: int,
2025-05-07T20:31:49.5372495Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5372585Z         contiguous: bool,
2025-05-07T20:31:49.5372667Z         compiled: bool,
2025-05-07T20:31:49.5372740Z     ) -> None:
2025-05-07T20:31:49.5372831Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5372901Z     
2025-05-07T20:31:49.5373072Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5373164Z     
2025-05-07T20:31:49.5373256Z >       x_sign = torch.sign(x)
2025-05-07T20:31:49.5375034Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.5375048Z 
2025-05-07T20:31:49.5375160Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:49.5375165Z 
2025-05-07T20:31:49.5375264Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5375487Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5375565Z     T=1,
2025-05-07T20:31:49.5375641Z     D=7168,
2025-05-07T20:31:49.5375716Z     scale_ub=1200.0,
2025-05-07T20:31:49.5375794Z     contiguous=True,
2025-05-07T20:31:49.5375872Z     compiled=False,
2025-05-07T20:31:49.5375941Z )
2025-05-07T20:31:49.5376150Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5376313Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:49.5376317Z 
2025-05-07T20:31:49.5376391Z     @given(
2025-05-07T20:31:49.5376503Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5376599Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5376711Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5376824Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5376933Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5377002Z     )
2025-05-07T20:31:49.5377246Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5377343Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5377416Z         self,
2025-05-07T20:31:49.5377491Z         T: int,
2025-05-07T20:31:49.5377560Z         D: int,
2025-05-07T20:31:49.5377653Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5377739Z         contiguous: bool,
2025-05-07T20:31:49.5377819Z         compiled: bool,
2025-05-07T20:31:49.5377890Z     ) -> None:
2025-05-07T20:31:49.5377982Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5378048Z     
2025-05-07T20:31:49.5378212Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5378281Z     
2025-05-07T20:31:49.5378449Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5378572Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5378656Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5378730Z         x0 = x[:, :D]
2025-05-07T20:31:49.5378809Z         x1 = x[:, D:]
2025-05-07T20:31:49.5378877Z     
2025-05-07T20:31:49.5379031Z         if contiguous:
2025-05-07T20:31:49.5379120Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5379204Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5379272Z     
2025-05-07T20:31:49.5379362Z         if scale_ub is not None:
2025-05-07T20:31:49.5379465Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5379600Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5379673Z             )
2025-05-07T20:31:49.5379743Z         else:
2025-05-07T20:31:49.5379834Z             scale_ub_tensor = None
2025-05-07T20:31:49.5379899Z     
2025-05-07T20:31:49.5380023Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5380121Z             op = silu_mul_quant
2025-05-07T20:31:49.5380202Z             if compiled:
2025-05-07T20:31:49.5380296Z                 op = torch.compile(op)
2025-05-07T20:31:49.5380400Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5380473Z     
2025-05-07T20:31:49.5380559Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5380569Z 
2025-05-07T20:31:49.5380665Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5380789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5380889Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5380983Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5381480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5381577Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5381937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5382156Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5382493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5382582Z     kernel = self.compile(
2025-05-07T20:31:49.5382964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5383135Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5383258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5383262Z 
2025-05-07T20:31:49.5383471Z self = <triton.compiler.compiler.ASTSource object at 0x7faa021e30a0>
2025-05-07T20:31:49.5384301Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5384808Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa021dc550>}
2025-05-07T20:31:49.5385549Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5385743Z context = <triton._C.libtriton.ir.context object at 0x7faa021c4670>
2025-05-07T20:31:49.5385750Z 
2025-05-07T20:31:49.5385912Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5386171Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5386277Z                            module_map=module_map)
2025-05-07T20:31:49.5386436Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5386636Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5386714Z E       ^
2025-05-07T20:31:49.5387065Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5387070Z 
2025-05-07T20:31:49.5387480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5387564Z 
2025-05-07T20:31:49.5387664Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5387886Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5387961Z     T=128,
2025-05-07T20:31:49.5388034Z     D=5120,
2025-05-07T20:31:49.5388110Z     scale_ub=None,
2025-05-07T20:31:49.5388192Z     contiguous=True,
2025-05-07T20:31:49.5388274Z     compiled=False,
2025-05-07T20:31:49.5388343Z )
2025-05-07T20:31:49.5388560Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5388729Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:49.5388733Z 
2025-05-07T20:31:49.5388806Z     @given(
2025-05-07T20:31:49.5388922Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5389015Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5389133Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5389244Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5389351Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5389421Z     )
2025-05-07T20:31:49.5389659Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5389753Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5389883Z         self,
2025-05-07T20:31:49.5389954Z         T: int,
2025-05-07T20:31:49.5390029Z         D: int,
2025-05-07T20:31:49.5390125Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5390211Z         contiguous: bool,
2025-05-07T20:31:49.5390297Z         compiled: bool,
2025-05-07T20:31:49.5390371Z     ) -> None:
2025-05-07T20:31:49.5390459Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5390534Z     
2025-05-07T20:31:49.5390698Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5390768Z     
2025-05-07T20:31:49.5390863Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5390981Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5391063Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5391141Z         x0 = x[:, :D]
2025-05-07T20:31:49.5391214Z         x1 = x[:, D:]
2025-05-07T20:31:49.5391288Z     
2025-05-07T20:31:49.5391367Z         if contiguous:
2025-05-07T20:31:49.5391451Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5391538Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5391604Z     
2025-05-07T20:31:49.5391691Z         if scale_ub is not None:
2025-05-07T20:31:49.5391797Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5391933Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5392005Z             )
2025-05-07T20:31:49.5392079Z         else:
2025-05-07T20:31:49.5392168Z             scale_ub_tensor = None
2025-05-07T20:31:49.5392238Z     
2025-05-07T20:31:49.5392365Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5392455Z             op = silu_mul_quant
2025-05-07T20:31:49.5392540Z             if compiled:
2025-05-07T20:31:49.5392637Z                 op = torch.compile(op)
2025-05-07T20:31:49.5392738Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5392812Z     
2025-05-07T20:31:49.5392898Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5392902Z 
2025-05-07T20:31:49.5392993Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5393119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5393215Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5393309Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5393892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5393990Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5394347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5394643Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5394980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5395073Z     kernel = self.compile(
2025-05-07T20:31:49.5395451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5395626Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5395748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5395757Z 
2025-05-07T20:31:49.5395958Z self = <triton.compiler.compiler.ASTSource object at 0x7faa02078ac0>
2025-05-07T20:31:49.5396735Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5397246Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02227040>}
2025-05-07T20:31:49.5397992Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5398182Z context = <triton._C.libtriton.ir.context object at 0x7faa02228c70>
2025-05-07T20:31:49.5398187Z 
2025-05-07T20:31:49.5398352Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5398614Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5398719Z                            module_map=module_map)
2025-05-07T20:31:49.5398880Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5398980Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5399055Z E       ^
2025-05-07T20:31:49.5399408Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5399413Z 
2025-05-07T20:31:49.5399821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5399826Z 
2025-05-07T20:31:49.5399928Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5400145Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5400219Z     T=128,
2025-05-07T20:31:49.5400299Z     D=7168,
2025-05-07T20:31:49.5400375Z     scale_ub=None,
2025-05-07T20:31:49.5400453Z     contiguous=True,
2025-05-07T20:31:49.5400533Z     compiled=False,
2025-05-07T20:31:49.5400600Z )
2025-05-07T20:31:49.5400812Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5400982Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:49.5400987Z 
2025-05-07T20:31:49.5401059Z     @given(
2025-05-07T20:31:49.5401172Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5401271Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5401383Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5401496Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5401605Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5401675Z     )
2025-05-07T20:31:49.5401915Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5402085Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5402161Z         self,
2025-05-07T20:31:49.5402236Z         T: int,
2025-05-07T20:31:49.5402308Z         D: int,
2025-05-07T20:31:49.5402401Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5402486Z         contiguous: bool,
2025-05-07T20:31:49.5402643Z         compiled: bool,
2025-05-07T20:31:49.5402721Z     ) -> None:
2025-05-07T20:31:49.5402812Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5402880Z     
2025-05-07T20:31:49.5403065Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5403146Z     
2025-05-07T20:31:49.5403247Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5403377Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5403460Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5403533Z         x0 = x[:, :D]
2025-05-07T20:31:49.5403610Z         x1 = x[:, D:]
2025-05-07T20:31:49.5403679Z     
2025-05-07T20:31:49.5403989Z         if contiguous:
2025-05-07T20:31:49.5404081Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5404165Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5404235Z     
2025-05-07T20:31:49.5404322Z         if scale_ub is not None:
2025-05-07T20:31:49.5404425Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5404566Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5404638Z             )
2025-05-07T20:31:49.5404706Z         else:
2025-05-07T20:31:49.5404797Z             scale_ub_tensor = None
2025-05-07T20:31:49.5404866Z     
2025-05-07T20:31:49.5404991Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5405081Z             op = silu_mul_quant
2025-05-07T20:31:49.5405160Z             if compiled:
2025-05-07T20:31:49.5405255Z                 op = torch.compile(op)
2025-05-07T20:31:49.5405359Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5405430Z     
2025-05-07T20:31:49.5405514Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5405525Z 
2025-05-07T20:31:49.5405618Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5405741Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5405843Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5405937Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5406435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5406533Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5406886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5407108Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5407442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5407529Z     kernel = self.compile(
2025-05-07T20:31:49.5407913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5408082Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5408204Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5408216Z 
2025-05-07T20:31:49.5408418Z self = <triton.compiler.compiler.ASTSource object at 0x7faa021c7970>
2025-05-07T20:31:49.5409191Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5409697Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa02227c10>}
2025-05-07T20:31:49.5410572Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5410767Z context = <triton._C.libtriton.ir.context object at 0x7faa020944b0>
2025-05-07T20:31:49.5410772Z 
2025-05-07T20:31:49.5410934Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5411304Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5411411Z                            module_map=module_map)
2025-05-07T20:31:49.5411574Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5411670Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5411747Z E       ^
2025-05-07T20:31:49.5412097Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5412102Z 
2025-05-07T20:31:49.5412520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5412524Z 
2025-05-07T20:31:49.5412625Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5412841Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5412919Z     T=2048,
2025-05-07T20:31:49.5412994Z     D=7168,
2025-05-07T20:31:49.5413074Z     scale_ub=1200.0,
2025-05-07T20:31:49.5413156Z     contiguous=True,
2025-05-07T20:31:49.5413233Z     compiled=False,
2025-05-07T20:31:49.5413307Z )
2025-05-07T20:31:49.5413555Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5413736Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:49.5413741Z 
2025-05-07T20:31:49.5413818Z     @given(
2025-05-07T20:31:49.5413935Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5414028Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5414149Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5414261Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5414370Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5414440Z     )
2025-05-07T20:31:49.5414679Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5414772Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5414844Z         self,
2025-05-07T20:31:49.5414913Z         T: int,
2025-05-07T20:31:49.5414985Z         D: int,
2025-05-07T20:31:49.5415079Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5415163Z         contiguous: bool,
2025-05-07T20:31:49.5415244Z         compiled: bool,
2025-05-07T20:31:49.5415318Z     ) -> None:
2025-05-07T20:31:49.5415407Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5415476Z     
2025-05-07T20:31:49.5415640Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5417439Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.5417449Z 
2025-05-07T20:31:49.5417563Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.5417567Z 
2025-05-07T20:31:49.5417668Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5417890Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5417964Z     T=1,
2025-05-07T20:31:49.5418039Z     D=5120,
2025-05-07T20:31:49.5418116Z     scale_ub=1200.0,
2025-05-07T20:31:49.5418192Z     contiguous=True,
2025-05-07T20:31:49.5418381Z     compiled=False,
2025-05-07T20:31:49.5418452Z )
2025-05-07T20:31:49.5418669Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5418830Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:49.5418834Z 
2025-05-07T20:31:49.5418983Z     @given(
2025-05-07T20:31:49.5419099Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5419194Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5419303Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5419417Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5419527Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5419595Z     )
2025-05-07T20:31:49.5419837Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5419927Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5420000Z         self,
2025-05-07T20:31:49.5420082Z         T: int,
2025-05-07T20:31:49.5420155Z         D: int,
2025-05-07T20:31:49.5420251Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5420339Z         contiguous: bool,
2025-05-07T20:31:49.5420418Z         compiled: bool,
2025-05-07T20:31:49.5420492Z     ) -> None:
2025-05-07T20:31:49.5420579Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5420654Z     
2025-05-07T20:31:49.5420818Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5420889Z     
2025-05-07T20:31:49.5420976Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5421097Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5421180Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5421256Z         x0 = x[:, :D]
2025-05-07T20:31:49.5421336Z         x1 = x[:, D:]
2025-05-07T20:31:49.5421404Z     
2025-05-07T20:31:49.5421481Z         if contiguous:
2025-05-07T20:31:49.5421572Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5421656Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5421733Z     
2025-05-07T20:31:49.5421819Z         if scale_ub is not None:
2025-05-07T20:31:49.5421920Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5422054Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5422126Z             )
2025-05-07T20:31:49.5422202Z         else:
2025-05-07T20:31:49.5422295Z             scale_ub_tensor = None
2025-05-07T20:31:49.5422362Z     
2025-05-07T20:31:49.5422487Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5422575Z             op = silu_mul_quant
2025-05-07T20:31:49.5422654Z             if compiled:
2025-05-07T20:31:49.5422751Z                 op = torch.compile(op)
2025-05-07T20:31:49.5422855Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5422925Z     
2025-05-07T20:31:49.5423010Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5423019Z 
2025-05-07T20:31:49.5423109Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5423236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5423336Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5423430Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5423924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5424024Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5424378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5424598Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5424934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5425024Z     kernel = self.compile(
2025-05-07T20:31:49.5425402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5425655Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5425778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5425783Z 
2025-05-07T20:31:49.5425989Z self = <triton.compiler.compiler.ASTSource object at 0x7faa021935b0>
2025-05-07T20:31:49.5426842Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5427344Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faa021ad9d0>}
2025-05-07T20:31:49.5428095Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5428289Z context = <triton._C.libtriton.ir.context object at 0x7fa5b1f879b0>
2025-05-07T20:31:49.5428294Z 
2025-05-07T20:31:49.5428456Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5428723Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5428836Z                            module_map=module_map)
2025-05-07T20:31:49.5429000Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5429093Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5429165Z E       ^
2025-05-07T20:31:49.5429521Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5429526Z 
2025-05-07T20:31:49.5430007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5430013Z 
2025-05-07T20:31:49.5430116Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5430337Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5430410Z     T=2048,
2025-05-07T20:31:49.5430487Z     D=5120,
2025-05-07T20:31:49.5430563Z     scale_ub=None,
2025-05-07T20:31:49.5430641Z     contiguous=True,
2025-05-07T20:31:49.5430726Z     compiled=False,
2025-05-07T20:31:49.5430794Z )
2025-05-07T20:31:49.5431009Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5431182Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:49.5431186Z 
2025-05-07T20:31:49.5431262Z     @given(
2025-05-07T20:31:49.5431375Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5431472Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5431584Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5431698Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5431813Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5431882Z     )
2025-05-07T20:31:49.5432127Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5432216Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5432289Z         self,
2025-05-07T20:31:49.5432368Z         T: int,
2025-05-07T20:31:49.5432440Z         D: int,
2025-05-07T20:31:49.5432533Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5432622Z         contiguous: bool,
2025-05-07T20:31:49.5432703Z         compiled: bool,
2025-05-07T20:31:49.5432774Z     ) -> None:
2025-05-07T20:31:49.5432867Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5432938Z     
2025-05-07T20:31:49.5433131Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5433217Z     
2025-05-07T20:31:49.5433307Z >       x_sign = torch.sign(x)
2025-05-07T20:31:49.5435188Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.5435264Z 
2025-05-07T20:31:49.5435379Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:49.5435384Z 
2025-05-07T20:31:49.5435484Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5435705Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5435779Z     T=16384,
2025-05-07T20:31:49.5435857Z     D=5120,
2025-05-07T20:31:49.5435934Z     scale_ub=None,
2025-05-07T20:31:49.5436013Z     contiguous=True,
2025-05-07T20:31:49.5436095Z     compiled=False,
2025-05-07T20:31:49.5436171Z )
2025-05-07T20:31:49.5436385Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5436558Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:49.5436562Z 
2025-05-07T20:31:49.5436636Z     @given(
2025-05-07T20:31:49.5436757Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5436851Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5436962Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5437075Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5437184Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5437254Z     )
2025-05-07T20:31:49.5437495Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5437584Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5437660Z         self,
2025-05-07T20:31:49.5437733Z         T: int,
2025-05-07T20:31:49.5437806Z         D: int,
2025-05-07T20:31:49.5437902Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5437986Z         contiguous: bool,
2025-05-07T20:31:49.5438066Z         compiled: bool,
2025-05-07T20:31:49.5438141Z     ) -> None:
2025-05-07T20:31:49.5438230Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5438298Z     
2025-05-07T20:31:49.5438470Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5440279Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.5440285Z 
2025-05-07T20:31:49.5440400Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.5440404Z 
2025-05-07T20:31:49.5440501Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5440725Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5440803Z     T=4096,
2025-05-07T20:31:49.5440873Z     D=5120,
2025-05-07T20:31:49.5440951Z     scale_ub=None,
2025-05-07T20:31:49.5441030Z     contiguous=True,
2025-05-07T20:31:49.5441109Z     compiled=False,
2025-05-07T20:31:49.5441182Z )
2025-05-07T20:31:49.5441397Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5441561Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:49.5441565Z 
2025-05-07T20:31:49.5441639Z     @given(
2025-05-07T20:31:49.5441750Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5441846Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5442036Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5442148Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5442257Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5442327Z     )
2025-05-07T20:31:49.5442566Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5442758Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5442830Z         self,
2025-05-07T20:31:49.5442903Z         T: int,
2025-05-07T20:31:49.5442978Z         D: int,
2025-05-07T20:31:49.5443072Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5443154Z         contiguous: bool,
2025-05-07T20:31:49.5443254Z         compiled: bool,
2025-05-07T20:31:49.5443337Z     ) -> None:
2025-05-07T20:31:49.5443447Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5443520Z     
2025-05-07T20:31:49.5443682Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5445471Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.5445482Z 
2025-05-07T20:31:49.5445592Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.5445596Z 
2025-05-07T20:31:49.5445695Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5445916Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5445990Z     T=2048,
2025-05-07T20:31:49.5446061Z     D=5120,
2025-05-07T20:31:49.5446136Z     scale_ub=None,
2025-05-07T20:31:49.5446222Z     contiguous=False,
2025-05-07T20:31:49.5446305Z     compiled=False,
2025-05-07T20:31:49.5446375Z )
2025-05-07T20:31:49.5446594Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5446760Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:49.5446769Z 
2025-05-07T20:31:49.5446842Z     @given(
2025-05-07T20:31:49.5446955Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5447047Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5447156Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5447271Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5447379Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5447446Z     )
2025-05-07T20:31:49.5447687Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5447776Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5447852Z         self,
2025-05-07T20:31:49.5447932Z         T: int,
2025-05-07T20:31:49.5448004Z         D: int,
2025-05-07T20:31:49.5448101Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5448185Z         contiguous: bool,
2025-05-07T20:31:49.5448263Z         compiled: bool,
2025-05-07T20:31:49.5448336Z     ) -> None:
2025-05-07T20:31:49.5448429Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5448497Z     
2025-05-07T20:31:49.5448660Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5450501Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.5450508Z 
2025-05-07T20:31:49.5450626Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.5450631Z 
2025-05-07T20:31:49.5450727Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5450949Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5451096Z     T=4096,
2025-05-07T20:31:49.5451168Z     D=7168,
2025-05-07T20:31:49.5451246Z     scale_ub=None,
2025-05-07T20:31:49.5451325Z     contiguous=True,
2025-05-07T20:31:49.5451399Z     compiled=True,
2025-05-07T20:31:49.5451468Z )
2025-05-07T20:31:49.5451682Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5451846Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:49.5451850Z 
2025-05-07T20:31:49.5451924Z     @given(
2025-05-07T20:31:49.5452037Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5452139Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5452248Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5452357Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5452467Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5452539Z     )
2025-05-07T20:31:49.5452777Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5452873Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5452944Z         self,
2025-05-07T20:31:49.5453015Z         T: int,
2025-05-07T20:31:49.5453090Z         D: int,
2025-05-07T20:31:49.5453184Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5453268Z         contiguous: bool,
2025-05-07T20:31:49.5453350Z         compiled: bool,
2025-05-07T20:31:49.5453421Z     ) -> None:
2025-05-07T20:31:49.5453512Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5453583Z     
2025-05-07T20:31:49.5453749Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5455519Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.5455529Z 
2025-05-07T20:31:49.5455640Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.5455644Z 
2025-05-07T20:31:49.5455746Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5455965Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5456039Z     T=2048,
2025-05-07T20:31:49.5456112Z     D=5120,
2025-05-07T20:31:49.5456194Z     scale_ub=1200.0,
2025-05-07T20:31:49.5456276Z     contiguous=False,
2025-05-07T20:31:49.5456356Z     compiled=False,
2025-05-07T20:31:49.5456425Z )
2025-05-07T20:31:49.5456642Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5456811Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:49.5456819Z 
2025-05-07T20:31:49.5456892Z     @given(
2025-05-07T20:31:49.5457009Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5457103Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5457210Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5457323Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5457431Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5457501Z     )
2025-05-07T20:31:49.5457745Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5457914Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5457994Z         self,
2025-05-07T20:31:49.5458067Z         T: int,
2025-05-07T20:31:49.5458136Z         D: int,
2025-05-07T20:31:49.5458233Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5458315Z         contiguous: bool,
2025-05-07T20:31:49.5458396Z         compiled: bool,
2025-05-07T20:31:49.5458542Z     ) -> None:
2025-05-07T20:31:49.5458630Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5458697Z     
2025-05-07T20:31:49.5458861Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5460623Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.5460628Z 
2025-05-07T20:31:49.5460746Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.5460751Z 
2025-05-07T20:31:49.5460855Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5461078Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5461149Z     T=4096,
2025-05-07T20:31:49.5461222Z     D=7168,
2025-05-07T20:31:49.5461300Z     scale_ub=1200.0,
2025-05-07T20:31:49.5461376Z     contiguous=True,
2025-05-07T20:31:49.5461453Z     compiled=False,
2025-05-07T20:31:49.5461529Z )
2025-05-07T20:31:49.5461738Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5461908Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:49.5461912Z 
2025-05-07T20:31:49.5461987Z     @given(
2025-05-07T20:31:49.5462105Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5462203Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5462312Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5462421Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5462541Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5462611Z     )
2025-05-07T20:31:49.5462850Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5462942Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5463013Z         self,
2025-05-07T20:31:49.5463088Z         T: int,
2025-05-07T20:31:49.5463176Z         D: int,
2025-05-07T20:31:49.5463278Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5463379Z         contiguous: bool,
2025-05-07T20:31:49.5463465Z         compiled: bool,
2025-05-07T20:31:49.5463540Z     ) -> None:
2025-05-07T20:31:49.5463631Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5463700Z     
2025-05-07T20:31:49.5463861Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5465628Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.5465639Z 
2025-05-07T20:31:49.5465751Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.5465755Z 
2025-05-07T20:31:49.5465855Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5466074Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5466230Z     T=16384,
2025-05-07T20:31:49.5466304Z     D=7168,
2025-05-07T20:31:49.5466379Z     scale_ub=None,
2025-05-07T20:31:49.5466457Z     contiguous=False,
2025-05-07T20:31:49.5466537Z     compiled=True,
2025-05-07T20:31:49.5466605Z )
2025-05-07T20:31:49.5466822Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5467139Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:49.5467144Z 
2025-05-07T20:31:49.5467217Z     @given(
2025-05-07T20:31:49.5467336Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5467428Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5467534Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5467647Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5467754Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5467825Z     )
2025-05-07T20:31:49.5468071Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5468159Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5468236Z         self,
2025-05-07T20:31:49.5468307Z         T: int,
2025-05-07T20:31:49.5468379Z         D: int,
2025-05-07T20:31:49.5468471Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5468560Z         contiguous: bool,
2025-05-07T20:31:49.5468643Z         compiled: bool,
2025-05-07T20:31:49.5468716Z     ) -> None:
2025-05-07T20:31:49.5468802Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5468874Z     
2025-05-07T20:31:49.5469033Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5475460Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.5475470Z 
2025-05-07T20:31:49.5475611Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.5475621Z 
2025-05-07T20:31:49.5475726Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5475944Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5476018Z     T=4096,
2025-05-07T20:31:49.5476092Z     D=7168,
2025-05-07T20:31:49.5476169Z     scale_ub=None,
2025-05-07T20:31:49.5476247Z     contiguous=True,
2025-05-07T20:31:49.5476329Z     compiled=False,
2025-05-07T20:31:49.5476397Z )
2025-05-07T20:31:49.5476607Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5476778Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:49.5476787Z 
2025-05-07T20:31:49.5476858Z     @given(
2025-05-07T20:31:49.5476976Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5477069Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5477180Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5477298Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5477408Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5477479Z     )
2025-05-07T20:31:49.5477723Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5477812Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5477883Z         self,
2025-05-07T20:31:49.5477958Z         T: int,
2025-05-07T20:31:49.5478030Z         D: int,
2025-05-07T20:31:49.5478125Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5478210Z         contiguous: bool,
2025-05-07T20:31:49.5478290Z         compiled: bool,
2025-05-07T20:31:49.5478367Z     ) -> None:
2025-05-07T20:31:49.5478567Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5478639Z     
2025-05-07T20:31:49.5478808Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5480583Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.5480663Z 
2025-05-07T20:31:49.5480784Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.5480788Z 
2025-05-07T20:31:49.5480886Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5481114Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5481192Z     T=16384,
2025-05-07T20:31:49.5481265Z     D=7168,
2025-05-07T20:31:49.5481344Z     scale_ub=None,
2025-05-07T20:31:49.5481423Z     contiguous=True,
2025-05-07T20:31:49.5481501Z     compiled=False,
2025-05-07T20:31:49.5481578Z )
2025-05-07T20:31:49.5481787Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5481958Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:49.5481963Z 
2025-05-07T20:31:49.5482040Z     @given(
2025-05-07T20:31:49.5482155Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5482250Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5482362Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5482470Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5482582Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5482657Z     )
2025-05-07T20:31:49.5482901Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5482995Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5483070Z         self,
2025-05-07T20:31:49.5483142Z         T: int,
2025-05-07T20:31:49.5483218Z         D: int,
2025-05-07T20:31:49.5483318Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5483399Z         contiguous: bool,
2025-05-07T20:31:49.5483482Z         compiled: bool,
2025-05-07T20:31:49.5483555Z     ) -> None:
2025-05-07T20:31:49.5483650Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5483723Z     
2025-05-07T20:31:49.5483885Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5485661Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.5485671Z 
2025-05-07T20:31:49.5485783Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.5485787Z 
2025-05-07T20:31:49.5485890Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5486109Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5486183Z     T=16384,
2025-05-07T20:31:49.5486259Z     D=7168,
2025-05-07T20:31:49.5486336Z     scale_ub=1200.0,
2025-05-07T20:31:49.5486414Z     contiguous=True,
2025-05-07T20:31:49.5486496Z     compiled=False,
2025-05-07T20:31:49.5486565Z )
2025-05-07T20:31:49.5486780Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5487035Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:49.5487040Z 
2025-05-07T20:31:49.5487115Z     @given(
2025-05-07T20:31:49.5487229Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5487325Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5487509Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5487622Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5487732Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5487801Z     )
2025-05-07T20:31:49.5488044Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5488133Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5488204Z         self,
2025-05-07T20:31:49.5488280Z         T: int,
2025-05-07T20:31:49.5488348Z         D: int,
2025-05-07T20:31:49.5488442Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5488523Z         contiguous: bool,
2025-05-07T20:31:49.5488606Z         compiled: bool,
2025-05-07T20:31:49.5488683Z     ) -> None:
2025-05-07T20:31:49.5488771Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5488841Z     
2025-05-07T20:31:49.5489003Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5490770Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.5490781Z 
2025-05-07T20:31:49.5490897Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.5490902Z 
2025-05-07T20:31:49.5491003Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5491223Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5491300Z     T=128,
2025-05-07T20:31:49.5491372Z     D=5120,
2025-05-07T20:31:49.5491453Z     scale_ub=1200.0,
2025-05-07T20:31:49.5491540Z     contiguous=False,
2025-05-07T20:31:49.5491617Z     compiled=False,
2025-05-07T20:31:49.5491688Z )
2025-05-07T20:31:49.5491901Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5492067Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:49.5492072Z 
2025-05-07T20:31:49.5492143Z     @given(
2025-05-07T20:31:49.5492255Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5492347Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5492457Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5492565Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5492679Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5492749Z     )
2025-05-07T20:31:49.5492990Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5493080Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5493168Z         self,
2025-05-07T20:31:49.5493254Z         T: int,
2025-05-07T20:31:49.5493337Z         D: int,
2025-05-07T20:31:49.5493443Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5493525Z         contiguous: bool,
2025-05-07T20:31:49.5493608Z         compiled: bool,
2025-05-07T20:31:49.5493680Z     ) -> None:
2025-05-07T20:31:49.5493774Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5493842Z     
2025-05-07T20:31:49.5494003Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5494076Z     
2025-05-07T20:31:49.5494163Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5494284Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5494453Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5494531Z         x0 = x[:, :D]
2025-05-07T20:31:49.5494607Z         x1 = x[:, D:]
2025-05-07T20:31:49.5494677Z     
2025-05-07T20:31:49.5494756Z         if contiguous:
2025-05-07T20:31:49.5494848Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5494938Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5495081Z     
2025-05-07T20:31:49.5495168Z         if scale_ub is not None:
2025-05-07T20:31:49.5495269Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5495400Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5495477Z             )
2025-05-07T20:31:49.5495550Z         else:
2025-05-07T20:31:49.5495640Z             scale_ub_tensor = None
2025-05-07T20:31:49.5495713Z     
2025-05-07T20:31:49.5495837Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5495921Z             op = silu_mul_quant
2025-05-07T20:31:49.5496004Z             if compiled:
2025-05-07T20:31:49.5496105Z                 op = torch.compile(op)
2025-05-07T20:31:49.5496209Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5496279Z     
2025-05-07T20:31:49.5496366Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5496370Z 
2025-05-07T20:31:49.5496463Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5496596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5496692Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5496787Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5497285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5497378Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5497736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5497954Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5498296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5498387Z     kernel = self.compile(
2025-05-07T20:31:49.5498762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5498941Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5499065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5499069Z 
2025-05-07T20:31:49.5499275Z self = <triton.compiler.compiler.ASTSource object at 0x7fa5b1ebcbe0>
2025-05-07T20:31:49.5500049Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5500560Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa5b1e7d670>}
2025-05-07T20:31:49.5501302Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5501494Z context = <triton._C.libtriton.ir.context object at 0x7fa5b1e182b0>
2025-05-07T20:31:49.5501499Z 
2025-05-07T20:31:49.5501664Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5501923Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5502026Z                            module_map=module_map)
2025-05-07T20:31:49.5502186Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5502280Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5502357Z E       ^
2025-05-07T20:31:49.5502791Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5502796Z 
2025-05-07T20:31:49.5503206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5503328Z 
2025-05-07T20:31:49.5503433Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5503681Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5503957Z     T=2048,
2025-05-07T20:31:49.5504030Z     D=7168,
2025-05-07T20:31:49.5504109Z     scale_ub=None,
2025-05-07T20:31:49.5504195Z     contiguous=False,
2025-05-07T20:31:49.5504273Z     compiled=False,
2025-05-07T20:31:49.5504341Z )
2025-05-07T20:31:49.5504555Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5504721Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:49.5504726Z 
2025-05-07T20:31:49.5504804Z     @given(
2025-05-07T20:31:49.5504923Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5505018Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5505133Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5505248Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5505363Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5505435Z     )
2025-05-07T20:31:49.5505674Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5505765Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5505840Z         self,
2025-05-07T20:31:49.5505911Z         T: int,
2025-05-07T20:31:49.5505981Z         D: int,
2025-05-07T20:31:49.5506079Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5506162Z         contiguous: bool,
2025-05-07T20:31:49.5506242Z         compiled: bool,
2025-05-07T20:31:49.5506317Z     ) -> None:
2025-05-07T20:31:49.5506409Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5506477Z     
2025-05-07T20:31:49.5506642Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5508409Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.5508422Z 
2025-05-07T20:31:49.5508535Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.5508540Z 
2025-05-07T20:31:49.5508636Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5508858Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5508931Z     T=128,
2025-05-07T20:31:49.5509002Z     D=7168,
2025-05-07T20:31:49.5509083Z     scale_ub=1200.0,
2025-05-07T20:31:49.5509162Z     contiguous=True,
2025-05-07T20:31:49.5509238Z     compiled=True,
2025-05-07T20:31:49.5509309Z )
2025-05-07T20:31:49.5509520Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5509690Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:49.5509694Z 
2025-05-07T20:31:49.5509766Z     @given(
2025-05-07T20:31:49.5509962Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5510059Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5510167Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5510276Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5510387Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5510459Z     )
2025-05-07T20:31:49.5510832Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5510930Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5511005Z         self,
2025-05-07T20:31:49.5511080Z         T: int,
2025-05-07T20:31:49.5511151Z         D: int,
2025-05-07T20:31:49.5511245Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5511438Z         contiguous: bool,
2025-05-07T20:31:49.5511519Z         compiled: bool,
2025-05-07T20:31:49.5511591Z     ) -> None:
2025-05-07T20:31:49.5511687Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5511756Z     
2025-05-07T20:31:49.5511920Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5511991Z     
2025-05-07T20:31:49.5512079Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5512200Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5512286Z         x = x_sign * x_clamp
2025-05-07T20:31:49.5512363Z         x0 = x[:, :D]
2025-05-07T20:31:49.5512445Z         x1 = x[:, D:]
2025-05-07T20:31:49.5512519Z     
2025-05-07T20:31:49.5512595Z         if contiguous:
2025-05-07T20:31:49.5512689Z             x0 = x0.contiguous()
2025-05-07T20:31:49.5512774Z             x1 = x1.contiguous()
2025-05-07T20:31:49.5512842Z     
2025-05-07T20:31:49.5512935Z         if scale_ub is not None:
2025-05-07T20:31:49.5513046Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.5513178Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.5513256Z             )
2025-05-07T20:31:49.5513330Z         else:
2025-05-07T20:31:49.5513418Z             scale_ub_tensor = None
2025-05-07T20:31:49.5513491Z     
2025-05-07T20:31:49.5513617Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.5513701Z             op = silu_mul_quant
2025-05-07T20:31:49.5513785Z             if compiled:
2025-05-07T20:31:49.5513880Z                 op = torch.compile(op)
2025-05-07T20:31:49.5513986Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5514060Z     
2025-05-07T20:31:49.5514146Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.5514150Z 
2025-05-07T20:31:49.5514249Z moe/activation_test.py:117: 
2025-05-07T20:31:49.5514371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5514466Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.5514566Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.5514927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:49.5515014Z     return fn(*args, **kwargs)
2025-05-07T20:31:49.5515507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.5515601Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.5515956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.5516179Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.5516512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.5516608Z     kernel = self.compile(
2025-05-07T20:31:49.5516980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.5517160Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.5517282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.5517286Z 
2025-05-07T20:31:49.5517490Z self = <triton.compiler.compiler.ASTSource object at 0x7fa5b1dd1100>
2025-05-07T20:31:49.5518267Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.5518854Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab96c475e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa5b1e665e0>}
2025-05-07T20:31:49.5519596Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.5519858Z context = <triton._C.libtriton.ir.context object at 0x7fa5b1dbca30>
2025-05-07T20:31:49.5519863Z 
2025-05-07T20:31:49.5520025Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.5520287Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.5520389Z                            module_map=module_map)
2025-05-07T20:31:49.5520553Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.5520648Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.5520731Z E       ^
2025-05-07T20:31:49.5521084Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.5521088Z 
2025-05-07T20:31:49.5521498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.5521508Z 
2025-05-07T20:31:49.5521606Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5521824Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5521897Z     T=128,
2025-05-07T20:31:49.5521970Z     D=7168,
2025-05-07T20:31:49.5522049Z     scale_ub=1200.0,
2025-05-07T20:31:49.5522128Z     contiguous=True,
2025-05-07T20:31:49.5522213Z     compiled=False,
2025-05-07T20:31:49.5522282Z )
2025-05-07T20:31:49.5522492Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5522664Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:49.5522668Z 
2025-05-07T20:31:49.5522740Z     @given(
2025-05-07T20:31:49.5522863Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5522958Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5523094Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5523230Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5523351Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5523420Z     )
2025-05-07T20:31:49.5523664Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5523753Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5523825Z         self,
2025-05-07T20:31:49.5523900Z         T: int,
2025-05-07T20:31:49.5523972Z         D: int,
2025-05-07T20:31:49.5524068Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5524152Z         contiguous: bool,
2025-05-07T20:31:49.5524230Z         compiled: bool,
2025-05-07T20:31:49.5524309Z     ) -> None:
2025-05-07T20:31:49.5524398Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5524467Z     
2025-05-07T20:31:49.5524634Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5524705Z     
2025-05-07T20:31:49.5524792Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.5524920Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.5526684Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.5526690Z 
2025-05-07T20:31:49.5526888Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:49.5526893Z 
2025-05-07T20:31:49.5526992Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5527211Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5527286Z     T=128,
2025-05-07T20:31:49.5527430Z     D=5120,
2025-05-07T20:31:49.5527513Z     scale_ub=1200.0,
2025-05-07T20:31:49.5527591Z     contiguous=True,
2025-05-07T20:31:49.5527667Z     compiled=True,
2025-05-07T20:31:49.5527739Z )
2025-05-07T20:31:49.5527956Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5528118Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:49.5528122Z 
2025-05-07T20:31:49.5528198Z     @given(
2025-05-07T20:31:49.5528311Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5528406Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5528527Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5528638Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5528748Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5528818Z     )
2025-05-07T20:31:49.5529057Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5529157Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5529231Z         self,
2025-05-07T20:31:49.5529304Z         T: int,
2025-05-07T20:31:49.5529375Z         D: int,
2025-05-07T20:31:49.5529469Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5529550Z         contiguous: bool,
2025-05-07T20:31:49.5529633Z         compiled: bool,
2025-05-07T20:31:49.5529708Z     ) -> None:
2025-05-07T20:31:49.5529799Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5529871Z     
2025-05-07T20:31:49.5530032Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5530103Z     
2025-05-07T20:31:49.5530194Z >       x_sign = torch.sign(x)
2025-05-07T20:31:49.5531953Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.5531968Z 
2025-05-07T20:31:49.5532080Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:49.5532084Z 
2025-05-07T20:31:49.5532182Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.5532402Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.5532474Z     T=128,
2025-05-07T20:31:49.5532546Z     D=7168,
2025-05-07T20:31:49.5532631Z     scale_ub=None,
2025-05-07T20:31:49.5532709Z     contiguous=True,
2025-05-07T20:31:49.5532786Z     compiled=True,
2025-05-07T20:31:49.5532857Z )
2025-05-07T20:31:49.5533072Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.5533251Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:49.5533262Z 
2025-05-07T20:31:49.5533342Z     @given(
2025-05-07T20:31:49.5533478Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.5533577Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.5533684Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.5533797Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.5533909Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.5533979Z     )
2025-05-07T20:31:49.5534217Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.5534310Z     def test_silu_mul_quant(
2025-05-07T20:31:49.5534485Z         self,
2025-05-07T20:31:49.5534565Z         T: int,
2025-05-07T20:31:49.5534634Z         D: int,
2025-05-07T20:31:49.5534726Z         scale_ub: Optional[float],
2025-05-07T20:31:49.5534816Z         contiguous: bool,
2025-05-07T20:31:49.5534894Z         compiled: bool,
2025-05-07T20:31:49.5535038Z     ) -> None:
2025-05-07T20:31:49.5535130Z         torch.manual_seed(2025)
2025-05-07T20:31:49.5535197Z     
2025-05-07T20:31:49.5535357Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.5537124Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:49.5537130Z 
2025-05-07T20:31:49.5537240Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:49.5537373Z =============================== warnings summary ===============================
2025-05-07T20:31:49.5537684Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:31:49.5537980Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:31:49.5538266Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:31:49.5539139Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:31:49.5539368Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:31:49.5539372Z 
2025-05-07T20:31:49.5539547Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings
2025-05-07T20:31:49.5540815Z   /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844.
2025-05-07T20:31:49.5541003Z     torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3)
2025-05-07T20:31:49.5541007Z 
2025-05-07T20:31:49.5541215Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:31:49.5541375Z ================== 1 failed, 1 passed, 13 warnings in 32.89s ===================
2025-05-07T20:31:51.2304189Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:31:51.2925677Z 
2025-05-07T20:31:51.2926136Z [TEST] Some tests FAILED.  Re-attempting only FAILED tests: ./moe/activation_test.py
2025-05-07T20:31:51.2926500Z 
2025-05-07T20:31:51.2926509Z 
2025-05-07T20:31:51.2948759Z [EXEC] [ATTEMPT 0/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:31:53.4611766Z ============================= test session starts ==============================
2025-05-07T20:31:53.4612786Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:53.4613700Z cachedir: .pytest_cache
2025-05-07T20:31:53.4615051Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:53.4616377Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:53.4617031Z plugins: hypothesis-6.131.14
2025-05-07T20:31:55.0572373Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:31:55.2703391Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:31:55.2704245Z run-last-failure: rerun previous 1 failure
2025-05-07T20:31:55.2704463Z 
2025-05-07T20:31:57.4658497Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:57.4659583Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:31:57.4660980Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:57.4662422Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:57.4663821Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:57.4665257Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.4666572Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:57.4667968Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.4669399Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:57.4670720Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:31:57.4671942Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:57.4673157Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:31:57.4674199Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:57.4675242Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:31:57.4676497Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:57.4677782Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:57.4679265Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:57.4680320Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:31:57.4681664Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:57.4683016Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:57.4684082Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.4685004Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.4685748Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:31:57.4686762Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.4829497Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:57.4830625Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:31:57.4831962Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:57.4833380Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:57.4834777Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:57.4836153Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.4837462Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:57.4838827Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.4840244Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:57.4841492Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:31:57.4842706Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:57.4844083Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:31:57.4845172Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:57.4846305Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:31:57.4847519Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:57.4848788Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:57.4849903Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:57.4850939Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:31:57.4852112Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:57.4853469Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:57.4854523Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.4855466Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.4856236Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:31:57.4857247Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.1287004Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.1287726Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.1288137Z     T=1,
2025-05-07T20:31:58.1288331Z     D=5120,
2025-05-07T20:31:58.1288520Z     scale_ub=None,
2025-05-07T20:31:58.1288734Z     contiguous=True,
2025-05-07T20:31:58.1288961Z     compiled=True,
2025-05-07T20:31:58.1289165Z )
2025-05-07T20:31:58.1289484Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.1290000Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:58.1290263Z 
2025-05-07T20:31:58.1290344Z     @given(
2025-05-07T20:31:58.1290569Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.1290883Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.1291199Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.1291527Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.1291858Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.1292144Z     )
2025-05-07T20:31:58.1292488Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.1292933Z     def test_silu_mul_quant(
2025-05-07T20:31:58.1293177Z         self,
2025-05-07T20:31:58.1293368Z         T: int,
2025-05-07T20:31:58.1293563Z         D: int,
2025-05-07T20:31:58.1293776Z         scale_ub: Optional[float],
2025-05-07T20:31:58.1294045Z         contiguous: bool,
2025-05-07T20:31:58.1294584Z         compiled: bool,
2025-05-07T20:31:58.1294812Z     ) -> None:
2025-05-07T20:31:58.1295031Z         torch.manual_seed(2025)
2025-05-07T20:31:58.1295266Z     
2025-05-07T20:31:58.1295560Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.1295932Z     
2025-05-07T20:31:58.1296267Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.1296561Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.1296875Z         x = x_sign * x_clamp
2025-05-07T20:31:58.1297111Z         x0 = x[:, :D]
2025-05-07T20:31:58.1297326Z         x1 = x[:, D:]
2025-05-07T20:31:58.1297530Z     
2025-05-07T20:31:58.1297708Z         if contiguous:
2025-05-07T20:31:58.1297938Z             x0 = x0.contiguous()
2025-05-07T20:31:58.1298197Z             x1 = x1.contiguous()
2025-05-07T20:31:58.1298430Z     
2025-05-07T20:31:58.1298618Z         if scale_ub is not None:
2025-05-07T20:31:58.1298890Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.1299233Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.1299540Z             )
2025-05-07T20:31:58.1299732Z         else:
2025-05-07T20:31:58.1299945Z             scale_ub_tensor = None
2025-05-07T20:31:58.1300191Z     
2025-05-07T20:31:58.1300423Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.1300744Z             op = silu_mul_quant
2025-05-07T20:31:58.1300989Z             if compiled:
2025-05-07T20:31:58.1301236Z                 op = torch.compile(op)
2025-05-07T20:31:58.1301537Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.1301810Z     
2025-05-07T20:31:58.1302004Z         y_fp8, y_scale = fn()
2025-05-07T20:31:58.1302291Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:58.1302575Z     
2025-05-07T20:31:58.1302820Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.1303157Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:58.1303447Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:58.1304080Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:58.1304443Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:58.1304752Z     
2025-05-07T20:31:58.1304950Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:58.1305152Z 
2025-05-07T20:31:58.1305254Z moe/activation_test.py:126: 
2025-05-07T20:31:58.1305553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.1305881Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:58.1306209Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:58.1307013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:58.1307783Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:58.1308331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.1309017Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.1309698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:58.1310466Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:58.1311247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:58.1311997Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:58.1312728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:58.1313371Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:58.1313979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:58.1323502Z     fn()
2025-05-07T20:31:58.1324066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:58.1324668Z     self.fn.run(
2025-05-07T20:31:58.1325149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.1325844Z     kernel = self.compile(
2025-05-07T20:31:58.1326392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.1327053Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.1327457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.1327684Z 
2025-05-07T20:31:58.1327891Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2ecb9040>
2025-05-07T20:31:58.1328999Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.1330400Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2ece7040>}
2025-05-07T20:31:58.1331758Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.1332788Z context = <triton._C.libtriton.ir.context object at 0x7feb2f8a4f30>
2025-05-07T20:31:58.1333076Z 
2025-05-07T20:31:58.1333248Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.1333776Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.1334249Z                            module_map=module_map)
2025-05-07T20:31:58.1334613Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.1334973Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:58.1335243Z E       ^
2025-05-07T20:31:58.1335711Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.1336172Z 
2025-05-07T20:31:58.1336588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.1337106Z 
2025-05-07T20:31:58.1337209Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.1337626Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.1338022Z     T=2048,
2025-05-07T20:31:58.1338210Z     D=5120,
2025-05-07T20:31:58.1338406Z     scale_ub=1200.0,
2025-05-07T20:31:58.1338620Z     contiguous=True,
2025-05-07T20:31:58.1338847Z     compiled=False,
2025-05-07T20:31:58.1339061Z )
2025-05-07T20:31:59.1827905Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:59.1829207Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:31:59.1830622Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:59.1832067Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:59.1833702Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:59.1835094Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.1836529Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:59.1837901Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.1839317Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:59.1840569Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:31:59.1841785Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:59.1843003Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:31:59.1844046Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:59.1845065Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:31:59.1846290Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:59.1847576Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:59.1848702Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:59.1849747Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:31:59.1850921Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:59.1852285Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:59.1853354Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.1854280Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.1855041Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:31:59.1856108Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.4149741Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:59.4150919Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:31:59.4152253Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:59.4153810Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:59.4155192Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:59.4156582Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.4157896Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:59.4159277Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.4160694Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:59.4161949Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:31:59.4163164Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:59.4164383Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:31:59.4165426Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:59.4166447Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:31:59.4167666Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:59.4168950Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:59.4170072Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:59.4171122Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:31:59.4172299Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:59.4173736Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:59.4174805Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.4175796Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.4176542Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:31:59.4177568Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.2823592Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.2824384Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:00.2824778Z 
2025-05-07T20:32:00.2824901Z     @given(
2025-05-07T20:32:00.2825212Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.2825652Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.2826088Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.2826539Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.2826966Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.2827253Z     )
2025-05-07T20:32:00.2827606Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.2828052Z     def test_silu_mul_quant(
2025-05-07T20:32:00.2828295Z         self,
2025-05-07T20:32:00.2828488Z         T: int,
2025-05-07T20:32:00.2828679Z         D: int,
2025-05-07T20:32:00.2828899Z         scale_ub: Optional[float],
2025-05-07T20:32:00.2829170Z         contiguous: bool,
2025-05-07T20:32:00.2829406Z         compiled: bool,
2025-05-07T20:32:00.2829635Z     ) -> None:
2025-05-07T20:32:00.2829937Z         torch.manual_seed(2025)
2025-05-07T20:32:00.2830181Z     
2025-05-07T20:32:00.2830459Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.2830807Z     
2025-05-07T20:32:00.2830995Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.2831298Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.2831610Z         x = x_sign * x_clamp
2025-05-07T20:32:00.2831854Z         x0 = x[:, :D]
2025-05-07T20:32:00.2832067Z         x1 = x[:, D:]
2025-05-07T20:32:00.2832275Z     
2025-05-07T20:32:00.2832463Z         if contiguous:
2025-05-07T20:32:00.2832688Z             x0 = x0.contiguous()
2025-05-07T20:32:00.2832952Z             x1 = x1.contiguous()
2025-05-07T20:32:00.2833200Z     
2025-05-07T20:32:00.2833393Z         if scale_ub is not None:
2025-05-07T20:32:00.2833673Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.2834022Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.2834329Z             )
2025-05-07T20:32:00.2834524Z         else:
2025-05-07T20:32:00.2834734Z             scale_ub_tensor = None
2025-05-07T20:32:00.2834979Z     
2025-05-07T20:32:00.2835213Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.2835535Z             op = silu_mul_quant
2025-05-07T20:32:00.2835782Z             if compiled:
2025-05-07T20:32:00.2836030Z                 op = torch.compile(op)
2025-05-07T20:32:00.2836328Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.2836603Z     
2025-05-07T20:32:00.2836787Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:00.2836957Z 
2025-05-07T20:32:00.2837057Z moe/activation_test.py:117: 
2025-05-07T20:32:00.2837352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.2837682Z moe/activation_test.py:115: in fn
2025-05-07T20:32:00.2837966Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.2838848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:00.2839548Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:00.2840089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.2840896Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.2841560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.2842088Z     kernel = self.compile(
2025-05-07T20:32:00.2842635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.2843291Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.2843684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.2843918Z 
2025-05-07T20:32:00.2844136Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2f81e520>
2025-05-07T20:32:00.2845238Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.2846650Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2f1a53a0>}
2025-05-07T20:32:00.2848010Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.2849038Z context = <triton._C.libtriton.ir.context object at 0x7feb2da08a30>
2025-05-07T20:32:00.2849336Z 
2025-05-07T20:32:00.2849508Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.2850050Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.2850545Z                            module_map=module_map)
2025-05-07T20:32:00.2850909Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.2851267Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.2851528Z E       ^
2025-05-07T20:32:00.2851994Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.2852463Z 
2025-05-07T20:32:00.2852884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.2853405Z 
2025-05-07T20:32:00.2853508Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.2853927Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.2854329Z     T=2048,
2025-05-07T20:32:00.2854518Z     D=5120,
2025-05-07T20:32:00.2854716Z     scale_ub=1200.0,
2025-05-07T20:32:00.2854932Z     contiguous=True,
2025-05-07T20:32:00.2855154Z     compiled=True,
2025-05-07T20:32:00.2855362Z )
2025-05-07T20:32:00.2855678Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.2856227Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:00.2856518Z 
2025-05-07T20:32:00.2856593Z     @given(
2025-05-07T20:32:00.2856823Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.2857129Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.2857437Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.2857773Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.2858101Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.2858390Z     )
2025-05-07T20:32:00.2858737Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.2859258Z     def test_silu_mul_quant(
2025-05-07T20:32:00.2859500Z         self,
2025-05-07T20:32:00.2859696Z         T: int,
2025-05-07T20:32:00.2859886Z         D: int,
2025-05-07T20:32:00.2860103Z         scale_ub: Optional[float],
2025-05-07T20:32:00.2860377Z         contiguous: bool,
2025-05-07T20:32:00.2860614Z         compiled: bool,
2025-05-07T20:32:00.2860909Z     ) -> None:
2025-05-07T20:32:00.2861122Z         torch.manual_seed(2025)
2025-05-07T20:32:00.2861366Z     
2025-05-07T20:32:00.2861655Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.2861992Z     
2025-05-07T20:32:00.2862181Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.2862471Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.2862777Z         x = x_sign * x_clamp
2025-05-07T20:32:00.2863019Z         x0 = x[:, :D]
2025-05-07T20:32:00.2863237Z         x1 = x[:, D:]
2025-05-07T20:32:00.2863437Z     
2025-05-07T20:32:00.2863620Z         if contiguous:
2025-05-07T20:32:00.2863852Z             x0 = x0.contiguous()
2025-05-07T20:32:00.2864106Z             x1 = x1.contiguous()
2025-05-07T20:32:00.2864348Z     
2025-05-07T20:32:00.2864542Z         if scale_ub is not None:
2025-05-07T20:32:00.2864814Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.2865156Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.2865476Z             )
2025-05-07T20:32:00.2865662Z         else:
2025-05-07T20:32:00.2865875Z             scale_ub_tensor = None
2025-05-07T20:32:00.2866167Z     
2025-05-07T20:32:00.2866415Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.2866732Z             op = silu_mul_quant
2025-05-07T20:32:00.2866984Z             if compiled:
2025-05-07T20:32:00.2867237Z                 op = torch.compile(op)
2025-05-07T20:32:00.2867534Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.2867814Z     
2025-05-07T20:32:00.2868014Z         y_fp8, y_scale = fn()
2025-05-07T20:32:00.2868303Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:00.2868596Z     
2025-05-07T20:32:00.2868834Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.2869167Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:00.2869461Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:00.2869860Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:00.2870217Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:00.2870531Z     
2025-05-07T20:32:00.2870728Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:00.2870923Z 
2025-05-07T20:32:00.2871024Z moe/activation_test.py:126: 
2025-05-07T20:32:00.2871315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.2871649Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:00.2871980Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:00.2872783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:00.2873551Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:00.2874100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.2874791Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.2875479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:00.2876209Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:00.2876965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:00.2877722Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:00.2878535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:00.2879177Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:00.2879780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:00.2880363Z     fn()
2025-05-07T20:32:00.2880874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:00.2881457Z     self.fn.run(
2025-05-07T20:32:00.2881926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.2882454Z     kernel = self.compile(
2025-05-07T20:32:00.2883000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.2883656Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.2884053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.2884289Z 
2025-05-07T20:32:00.2884497Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2d91d220>
2025-05-07T20:32:00.2885605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.2887062Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2d89f670>}
2025-05-07T20:32:00.2888435Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.2889482Z context = <triton._C.libtriton.ir.context object at 0x7feb2d64bbf0>
2025-05-07T20:32:00.2889775Z 
2025-05-07T20:32:00.2889949Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.2890472Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.2890945Z                            module_map=module_map)
2025-05-07T20:32:00.2891319Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.2891679Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:00.2891942Z E       ^
2025-05-07T20:32:00.2892413Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.2892869Z 
2025-05-07T20:32:00.2893295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.2893812Z 
2025-05-07T20:32:00.2893914Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.2894339Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.2894749Z     T=16384,
2025-05-07T20:32:00.2894948Z     D=7168,
2025-05-07T20:32:00.2895143Z     scale_ub=1200.0,
2025-05-07T20:32:00.2895369Z     contiguous=False,
2025-05-07T20:32:00.2895604Z     compiled=False,
2025-05-07T20:32:00.2895804Z )
2025-05-07T20:32:00.9222665Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:00.9223858Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:32:00.9225803Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:00.9227532Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:00.9228915Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:00.9230514Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.9231822Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:00.9233202Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.9234610Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:00.9235871Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:32:00.9237078Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:00.9238283Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:32:00.9239323Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:00.9240345Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:32:00.9241550Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:00.9242828Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:00.9243937Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:00.9244985Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:32:00.9246216Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:00.9247572Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:00.9248639Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.9249553Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.9250296Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:32:00.9251401Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.0993104Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:01.0994490Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:32:01.0995824Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:01.0997294Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:01.0998667Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:01.1000051Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.1001349Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:01.1002721Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.1004313Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:01.1005560Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:32:01.1006784Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:01.1007992Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:32:01.1009029Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:01.1010054Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:32:01.1011277Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:01.1012571Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:01.1013680Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:01.1014860Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:32:01.1016044Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:01.1017396Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:01.1018558Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.1019474Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.1020217Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:32:01.1021242Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.2547402Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.2548926Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:02.2549719Z 
2025-05-07T20:32:02.2550036Z     @given(
2025-05-07T20:32:02.2550638Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.2551262Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.2551876Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.2552529Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.2553185Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.2553752Z     )
2025-05-07T20:32:02.2554441Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.2555329Z     def test_silu_mul_quant(
2025-05-07T20:32:02.2555812Z         self,
2025-05-07T20:32:02.2556192Z         T: int,
2025-05-07T20:32:02.2556429Z         D: int,
2025-05-07T20:32:02.2556688Z         scale_ub: Optional[float],
2025-05-07T20:32:02.2556964Z         contiguous: bool,
2025-05-07T20:32:02.2557208Z         compiled: bool,
2025-05-07T20:32:02.2557447Z     ) -> None:
2025-05-07T20:32:02.2557662Z         torch.manual_seed(2025)
2025-05-07T20:32:02.2557910Z     
2025-05-07T20:32:02.2558187Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.2558530Z     
2025-05-07T20:32:02.2558733Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.2559034Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.2559350Z         x = x_sign * x_clamp
2025-05-07T20:32:02.2559589Z         x0 = x[:, :D]
2025-05-07T20:32:02.2559812Z         x1 = x[:, D:]
2025-05-07T20:32:02.2560025Z     
2025-05-07T20:32:02.2560212Z         if contiguous:
2025-05-07T20:32:02.2560455Z             x0 = x0.contiguous()
2025-05-07T20:32:02.2560725Z             x1 = x1.contiguous()
2025-05-07T20:32:02.2560966Z     
2025-05-07T20:32:02.2561163Z         if scale_ub is not None:
2025-05-07T20:32:02.2561448Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.2561790Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.2562112Z             )
2025-05-07T20:32:02.2562315Z         else:
2025-05-07T20:32:02.2562525Z             scale_ub_tensor = None
2025-05-07T20:32:02.2562786Z     
2025-05-07T20:32:02.2563024Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.2563337Z             op = silu_mul_quant
2025-05-07T20:32:02.2563593Z             if compiled:
2025-05-07T20:32:02.2563860Z                 op = torch.compile(op)
2025-05-07T20:32:02.2564169Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.2564445Z     
2025-05-07T20:32:02.2564641Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:02.2564804Z 
2025-05-07T20:32:02.2565084Z moe/activation_test.py:117: 
2025-05-07T20:32:02.2565412Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.2565744Z moe/activation_test.py:115: in fn
2025-05-07T20:32:02.2572915Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.2573635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:02.2574501Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:02.2575049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.2575739Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.2576453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.2576991Z     kernel = self.compile(
2025-05-07T20:32:02.2577551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.2578202Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.2578604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.2578843Z 
2025-05-07T20:32:02.2579061Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2d779790>
2025-05-07T20:32:02.2580155Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.2581544Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2d88c280>}
2025-05-07T20:32:02.2582898Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.2583928Z context = <triton._C.libtriton.ir.context object at 0x7feb2d5e2630>
2025-05-07T20:32:02.2584215Z 
2025-05-07T20:32:02.2584394Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.2584925Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.2585395Z                            module_map=module_map)
2025-05-07T20:32:02.2585775Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.2586135Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.2586417Z E       ^
2025-05-07T20:32:02.2586922Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.2587371Z 
2025-05-07T20:32:02.2587799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.2588314Z 
2025-05-07T20:32:02.2588429Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.2588844Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.2589248Z     T=1,
2025-05-07T20:32:02.2589436Z     D=7168,
2025-05-07T20:32:02.2589632Z     scale_ub=None,
2025-05-07T20:32:02.2589934Z     contiguous=True,
2025-05-07T20:32:02.2590170Z     compiled=True,
2025-05-07T20:32:02.2590375Z )
2025-05-07T20:32:02.2590701Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.2591194Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:02.2591451Z 
2025-05-07T20:32:02.2591540Z     @given(
2025-05-07T20:32:02.2591774Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.2592088Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.2592400Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.2592810Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.2593149Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.2593446Z     )
2025-05-07T20:32:02.2593793Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.2594240Z     def test_silu_mul_quant(
2025-05-07T20:32:02.2594631Z         self,
2025-05-07T20:32:02.2594825Z         T: int,
2025-05-07T20:32:02.2595024Z         D: int,
2025-05-07T20:32:02.2595249Z         scale_ub: Optional[float],
2025-05-07T20:32:02.2595518Z         contiguous: bool,
2025-05-07T20:32:02.2595763Z         compiled: bool,
2025-05-07T20:32:02.2595994Z     ) -> None:
2025-05-07T20:32:02.2596215Z         torch.manual_seed(2025)
2025-05-07T20:32:02.2596458Z     
2025-05-07T20:32:02.2596742Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.2597098Z     
2025-05-07T20:32:02.2597292Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.2597599Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.2597921Z         x = x_sign * x_clamp
2025-05-07T20:32:02.2598159Z         x0 = x[:, :D]
2025-05-07T20:32:02.2598384Z         x1 = x[:, D:]
2025-05-07T20:32:02.2598601Z     
2025-05-07T20:32:02.2598785Z         if contiguous:
2025-05-07T20:32:02.2599017Z             x0 = x0.contiguous()
2025-05-07T20:32:02.2599292Z             x1 = x1.contiguous()
2025-05-07T20:32:02.2599533Z     
2025-05-07T20:32:02.2599719Z         if scale_ub is not None:
2025-05-07T20:32:02.2599997Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.2600334Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.2600639Z             )
2025-05-07T20:32:02.2600838Z         else:
2025-05-07T20:32:02.2601054Z             scale_ub_tensor = None
2025-05-07T20:32:02.2601300Z     
2025-05-07T20:32:02.2601541Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.2601858Z             op = silu_mul_quant
2025-05-07T20:32:02.2602108Z             if compiled:
2025-05-07T20:32:02.2602355Z                 op = torch.compile(op)
2025-05-07T20:32:02.2602650Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.2602921Z     
2025-05-07T20:32:02.2603113Z         y_fp8, y_scale = fn()
2025-05-07T20:32:02.2603400Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:02.2603700Z     
2025-05-07T20:32:02.2604125Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.2604463Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:02.2604757Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:02.2605067Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:02.2605425Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:02.2605737Z     
2025-05-07T20:32:02.2605937Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:02.2606158Z 
2025-05-07T20:32:02.2606270Z moe/activation_test.py:126: 
2025-05-07T20:32:02.2606590Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.2606924Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:02.2607243Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:02.2608036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:02.2608807Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:02.2609347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.2610022Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.2610703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:02.2611425Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:02.2612294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:02.2613046Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:02.2613775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:02.2614522Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:02.2615120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:02.2615634Z     fn()
2025-05-07T20:32:02.2616136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:02.2616766Z     self.fn.run(
2025-05-07T20:32:02.2617250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.2617781Z     kernel = self.compile(
2025-05-07T20:32:02.2618321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.2618963Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.2619367Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.2619606Z 
2025-05-07T20:32:02.2619825Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2d77cdc0>
2025-05-07T20:32:02.2620909Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.2622296Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2d89f5e0>}
2025-05-07T20:32:02.2623651Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.2624683Z context = <triton._C.libtriton.ir.context object at 0x7feb2cfd3ef0>
2025-05-07T20:32:02.2624977Z 
2025-05-07T20:32:02.2625157Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.2625680Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.2626152Z                            module_map=module_map)
2025-05-07T20:32:02.2626525Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.2626886Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:02.2627154Z E       ^
2025-05-07T20:32:02.2627622Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.2628071Z 
2025-05-07T20:32:02.2628502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.2629011Z 
2025-05-07T20:32:02.2629125Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.2629541Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.2630039Z     T=4096,
2025-05-07T20:32:02.2630228Z     D=5120,
2025-05-07T20:32:02.2630412Z     scale_ub=None,
2025-05-07T20:32:02.2630628Z     contiguous=False,
2025-05-07T20:32:02.2630855Z     compiled=False,
2025-05-07T20:32:02.2631054Z )
2025-05-07T20:32:02.9419205Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:02.9420554Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:32:02.9422069Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:02.9423486Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:02.9424990Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:02.9426359Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.9427660Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:02.9429026Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.9430502Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:02.9431736Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:32:02.9432948Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:02.9434151Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:32:02.9435188Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:02.9436205Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:32:02.9437417Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:02.9438707Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:02.9439819Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:02.9440856Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:32:02.9442031Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:02.9443376Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:02.9444528Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.9445440Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.9446179Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:32:02.9447189Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.6217350Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:03.6218436Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:32:03.6219792Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:03.6221223Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:03.6222608Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:03.6224000Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.6225321Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:03.6226697Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.6228101Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:03.6229351Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:32:03.6230646Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:03.6231860Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:32:03.6232894Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:03.6233914Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:32:03.6235127Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:03.6236448Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:03.6237757Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:03.6238809Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:32:03.6240097Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:03.6241457Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:03.6242522Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.6243450Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.6244194Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:32:03.6245215Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.9441686Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.9442248Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:04.9442638Z 
2025-05-07T20:32:04.9442755Z     @given(
2025-05-07T20:32:04.9443065Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.9443501Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.9443929Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.9444396Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.9444873Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.9445244Z     )
2025-05-07T20:32:04.9445637Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.9446082Z     def test_silu_mul_quant(
2025-05-07T20:32:04.9446328Z         self,
2025-05-07T20:32:04.9446527Z         T: int,
2025-05-07T20:32:04.9446729Z         D: int,
2025-05-07T20:32:04.9446946Z         scale_ub: Optional[float],
2025-05-07T20:32:04.9447219Z         contiguous: bool,
2025-05-07T20:32:04.9447463Z         compiled: bool,
2025-05-07T20:32:04.9447685Z     ) -> None:
2025-05-07T20:32:04.9447902Z         torch.manual_seed(2025)
2025-05-07T20:32:04.9448147Z     
2025-05-07T20:32:04.9448417Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.9448764Z     
2025-05-07T20:32:04.9448964Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.9449262Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.9449570Z         x = x_sign * x_clamp
2025-05-07T20:32:04.9449814Z         x0 = x[:, :D]
2025-05-07T20:32:04.9450031Z         x1 = x[:, D:]
2025-05-07T20:32:04.9450234Z     
2025-05-07T20:32:04.9450425Z         if contiguous:
2025-05-07T20:32:04.9450659Z             x0 = x0.contiguous()
2025-05-07T20:32:04.9450919Z             x1 = x1.contiguous()
2025-05-07T20:32:04.9451161Z     
2025-05-07T20:32:04.9451356Z         if scale_ub is not None:
2025-05-07T20:32:04.9451629Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.9451970Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.9452286Z             )
2025-05-07T20:32:04.9452478Z         else:
2025-05-07T20:32:04.9452695Z             scale_ub_tensor = None
2025-05-07T20:32:04.9452950Z     
2025-05-07T20:32:04.9453180Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.9453499Z             op = silu_mul_quant
2025-05-07T20:32:04.9453931Z             if compiled:
2025-05-07T20:32:04.9454185Z                 op = torch.compile(op)
2025-05-07T20:32:04.9454482Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.9454759Z     
2025-05-07T20:32:04.9454951Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:04.9455118Z 
2025-05-07T20:32:04.9455219Z moe/activation_test.py:117: 
2025-05-07T20:32:04.9455635Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.9455969Z moe/activation_test.py:115: in fn
2025-05-07T20:32:04.9456248Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.9456942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:04.9457632Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:04.9458172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.9458860Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.9459519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.9460048Z     kernel = self.compile(
2025-05-07T20:32:04.9460584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.9461245Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.9461652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.9461878Z 
2025-05-07T20:32:04.9462084Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2f3c4d60>
2025-05-07T20:32:04.9463170Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.9464557Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2d1f3550>}
2025-05-07T20:32:04.9465902Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.9466931Z context = <triton._C.libtriton.ir.context object at 0x7feb2cfebc30>
2025-05-07T20:32:04.9467217Z 
2025-05-07T20:32:04.9467381Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.9467903Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.9468372Z                            module_map=module_map)
2025-05-07T20:32:04.9468742Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.9469092Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.9469361Z E       ^
2025-05-07T20:32:04.9469925Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.9470379Z 
2025-05-07T20:32:04.9470794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.9471317Z 
2025-05-07T20:32:04.9471420Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.9471839Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.9472246Z     T=4096,
2025-05-07T20:32:04.9472430Z     D=7168,
2025-05-07T20:32:04.9472627Z     scale_ub=None,
2025-05-07T20:32:04.9472844Z     contiguous=False,
2025-05-07T20:32:04.9473066Z     compiled=False,
2025-05-07T20:32:04.9473275Z )
2025-05-07T20:32:04.9473596Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.9474086Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:04.9474451Z 
2025-05-07T20:32:04.9474531Z     @given(
2025-05-07T20:32:04.9474762Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.9475068Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.9475379Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.9475812Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.9476144Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.9476425Z     )
2025-05-07T20:32:04.9476774Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.9477219Z     def test_silu_mul_quant(
2025-05-07T20:32:04.9477457Z         self,
2025-05-07T20:32:04.9477653Z         T: int,
2025-05-07T20:32:04.9477851Z         D: int,
2025-05-07T20:32:04.9478066Z         scale_ub: Optional[float],
2025-05-07T20:32:04.9478337Z         contiguous: bool,
2025-05-07T20:32:04.9478578Z         compiled: bool,
2025-05-07T20:32:04.9478798Z     ) -> None:
2025-05-07T20:32:04.9479022Z         torch.manual_seed(2025)
2025-05-07T20:32:04.9479265Z     
2025-05-07T20:32:04.9479538Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.9479880Z     
2025-05-07T20:32:04.9480075Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.9480361Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.9480680Z         x = x_sign * x_clamp
2025-05-07T20:32:04.9480921Z         x0 = x[:, :D]
2025-05-07T20:32:04.9481143Z         x1 = x[:, D:]
2025-05-07T20:32:04.9481347Z     
2025-05-07T20:32:04.9481537Z         if contiguous:
2025-05-07T20:32:04.9481772Z             x0 = x0.contiguous()
2025-05-07T20:32:04.9482027Z             x1 = x1.contiguous()
2025-05-07T20:32:04.9482271Z     
2025-05-07T20:32:04.9482468Z         if scale_ub is not None:
2025-05-07T20:32:04.9482739Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.9483074Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.9483391Z             )
2025-05-07T20:32:04.9483585Z         else:
2025-05-07T20:32:04.9483799Z             scale_ub_tensor = None
2025-05-07T20:32:04.9484052Z     
2025-05-07T20:32:04.9484282Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.9484601Z             op = silu_mul_quant
2025-05-07T20:32:04.9484860Z             if compiled:
2025-05-07T20:32:04.9485102Z                 op = torch.compile(op)
2025-05-07T20:32:04.9485405Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.9485684Z     
2025-05-07T20:32:04.9485875Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:04.9486039Z 
2025-05-07T20:32:04.9486143Z moe/activation_test.py:117: 
2025-05-07T20:32:04.9486438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.9486807Z moe/activation_test.py:115: in fn
2025-05-07T20:32:04.9487100Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.9487794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:04.9488488Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:04.9489025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.9489701Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.9490366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.9490895Z     kernel = self.compile(
2025-05-07T20:32:04.9491435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.9492088Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.9492488Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.9492716Z 
2025-05-07T20:32:04.9493015Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2d391160>
2025-05-07T20:32:04.9494097Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.9495548Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2d096700>}
2025-05-07T20:32:04.9496897Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.9497976Z context = <triton._C.libtriton.ir.context object at 0x7feb2cb51cb0>
2025-05-07T20:32:04.9498264Z 
2025-05-07T20:32:04.9498440Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.9498971Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.9499439Z                            module_map=module_map)
2025-05-07T20:32:04.9499806Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.9500155Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.9500422Z E       ^
2025-05-07T20:32:04.9500895Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.9501349Z 
2025-05-07T20:32:04.9501773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.9502282Z 
2025-05-07T20:32:04.9502386Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.9502804Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.9503215Z     T=128,
2025-05-07T20:32:04.9503400Z     D=7168,
2025-05-07T20:32:04.9503598Z     scale_ub=None,
2025-05-07T20:32:04.9504003Z     contiguous=False,
2025-05-07T20:32:04.9504225Z     compiled=True,
2025-05-07T20:32:04.9504426Z )
2025-05-07T20:32:05.0278776Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.0280199Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:05.0280951Z 
2025-05-07T20:32:05.0281174Z     @given(
2025-05-07T20:32:05.0281784Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.0282413Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.0283015Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.0283652Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.0284299Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.0284849Z     )
2025-05-07T20:32:05.0285523Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.0286391Z     def test_silu_mul_quant(
2025-05-07T20:32:05.0286726Z         self,
2025-05-07T20:32:05.0286912Z         T: int,
2025-05-07T20:32:05.0287098Z         D: int,
2025-05-07T20:32:05.0287313Z         scale_ub: Optional[float],
2025-05-07T20:32:05.0287582Z         contiguous: bool,
2025-05-07T20:32:05.0287811Z         compiled: bool,
2025-05-07T20:32:05.0288031Z     ) -> None:
2025-05-07T20:32:05.0288243Z         torch.manual_seed(2025)
2025-05-07T20:32:05.0288473Z     
2025-05-07T20:32:05.0288739Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.0289074Z     
2025-05-07T20:32:05.0289254Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.0289538Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.0295980Z         x = x_sign * x_clamp
2025-05-07T20:32:05.0296265Z         x0 = x[:, :D]
2025-05-07T20:32:05.0296489Z         x1 = x[:, D:]
2025-05-07T20:32:05.0296714Z     
2025-05-07T20:32:05.0296936Z         if contiguous:
2025-05-07T20:32:05.0297171Z             x0 = x0.contiguous()
2025-05-07T20:32:05.0297610Z             x1 = x1.contiguous()
2025-05-07T20:32:05.0297859Z     
2025-05-07T20:32:05.0298056Z         if scale_ub is not None:
2025-05-07T20:32:05.0298334Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.0298676Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.0299100Z             )
2025-05-07T20:32:05.0299297Z         else:
2025-05-07T20:32:05.0299508Z             scale_ub_tensor = None
2025-05-07T20:32:05.0299765Z     
2025-05-07T20:32:05.0300011Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.0300326Z             op = silu_mul_quant
2025-05-07T20:32:05.0300583Z             if compiled:
2025-05-07T20:32:05.0300839Z                 op = torch.compile(op)
2025-05-07T20:32:05.0301132Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.0301412Z     
2025-05-07T20:32:05.0301604Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.0301889Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.0302192Z     
2025-05-07T20:32:05.0302437Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.0302772Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.0303063Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.0303381Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.0303929Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.0304238Z     
2025-05-07T20:32:05.0304444Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.0304641Z 
2025-05-07T20:32:05.0304751Z moe/activation_test.py:126: 
2025-05-07T20:32:05.0305045Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.0305383Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.0305711Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.0306505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.0307314Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.0307857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.0308540Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.0309232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.0310029Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.0310783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:05.0311523Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.0312249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.0312891Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.0313491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.0314003Z     fn()
2025-05-07T20:32:05.0314501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.0315096Z     self.fn.run(
2025-05-07T20:32:05.0315556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.0316078Z     kernel = self.compile(
2025-05-07T20:32:05.0316619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.0317270Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.0317668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.0318018Z 
2025-05-07T20:32:05.0318235Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2ce79a00>
2025-05-07T20:32:05.0319337Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.0320824Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2d3c13a0>}
2025-05-07T20:32:05.0322165Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.0323193Z context = <triton._C.libtriton.ir.context object at 0x7feb2c9a6570>
2025-05-07T20:32:05.0323479Z 
2025-05-07T20:32:05.0323653Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.0324173Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.0324639Z                            module_map=module_map)
2025-05-07T20:32:05.0325005Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.0325363Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.0325634Z E       ^
2025-05-07T20:32:05.0326102Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.0326558Z 
2025-05-07T20:32:05.0326981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.0327490Z 
2025-05-07T20:32:05.0327593Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.0328008Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.0328413Z     T=128,
2025-05-07T20:32:05.0328596Z     D=7168,
2025-05-07T20:32:05.0328788Z     scale_ub=None,
2025-05-07T20:32:05.0329007Z     contiguous=False,
2025-05-07T20:32:05.0329226Z     compiled=False,
2025-05-07T20:32:05.0329429Z )
2025-05-07T20:32:05.4376922Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.4377731Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:05.4378111Z 
2025-05-07T20:32:05.4378226Z     @given(
2025-05-07T20:32:05.4378533Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.4378926Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.4379239Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.4379574Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.4379900Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.4380191Z     )
2025-05-07T20:32:05.4380546Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.4380983Z     def test_silu_mul_quant(
2025-05-07T20:32:05.4381230Z         self,
2025-05-07T20:32:05.4381429Z         T: int,
2025-05-07T20:32:05.4381622Z         D: int,
2025-05-07T20:32:05.4381844Z         scale_ub: Optional[float],
2025-05-07T20:32:05.4382116Z         contiguous: bool,
2025-05-07T20:32:05.4382356Z         compiled: bool,
2025-05-07T20:32:05.4382587Z     ) -> None:
2025-05-07T20:32:05.4382807Z         torch.manual_seed(2025)
2025-05-07T20:32:05.4383044Z     
2025-05-07T20:32:05.4383320Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.4383669Z     
2025-05-07T20:32:05.4383859Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.4384155Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.4384465Z         x = x_sign * x_clamp
2025-05-07T20:32:05.4384708Z         x0 = x[:, :D]
2025-05-07T20:32:05.4384926Z         x1 = x[:, D:]
2025-05-07T20:32:05.4385139Z     
2025-05-07T20:32:05.4385502Z         if contiguous:
2025-05-07T20:32:05.4385736Z             x0 = x0.contiguous()
2025-05-07T20:32:05.4386002Z             x1 = x1.contiguous()
2025-05-07T20:32:05.4386247Z     
2025-05-07T20:32:05.4386433Z         if scale_ub is not None:
2025-05-07T20:32:05.4386706Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.4387187Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.4387518Z             )
2025-05-07T20:32:05.4387716Z         else:
2025-05-07T20:32:05.4387925Z             scale_ub_tensor = None
2025-05-07T20:32:05.4388173Z     
2025-05-07T20:32:05.4388410Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.4388730Z             op = silu_mul_quant
2025-05-07T20:32:05.4388985Z             if compiled:
2025-05-07T20:32:05.4389236Z                 op = torch.compile(op)
2025-05-07T20:32:05.4389537Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.4389873Z     
2025-05-07T20:32:05.4390068Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.4390235Z 
2025-05-07T20:32:05.4390337Z moe/activation_test.py:117: 
2025-05-07T20:32:05.4390634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4390963Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.4391244Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.4391944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.4392628Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.4393165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.4393848Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.4394502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.4395028Z     kernel = self.compile(
2025-05-07T20:32:05.4395574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.4396226Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.4396617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4396858Z 
2025-05-07T20:32:05.4397070Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2cbb0dc0>
2025-05-07T20:32:05.4398156Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.4399555Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2d169b80>}
2025-05-07T20:32:05.4400907Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.4401922Z context = <triton._C.libtriton.ir.context object at 0x7feb2c968670>
2025-05-07T20:32:05.4402215Z 
2025-05-07T20:32:05.4402385Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.4402904Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.4403373Z                            module_map=module_map)
2025-05-07T20:32:05.4403905Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.4404269Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.4404524Z E       ^
2025-05-07T20:32:05.4404991Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.4405447Z 
2025-05-07T20:32:05.4405991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.4406516Z 
2025-05-07T20:32:05.4406619Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.4407073Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.4407605Z     T=4096,
2025-05-07T20:32:05.4407795Z     D=5120,
2025-05-07T20:32:05.4407990Z     scale_ub=1200.0,
2025-05-07T20:32:05.4408210Z     contiguous=True,
2025-05-07T20:32:05.4408436Z     compiled=False,
2025-05-07T20:32:05.4408640Z )
2025-05-07T20:32:05.4408959Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.4409478Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:05.4409755Z 
2025-05-07T20:32:05.4409832Z     @given(
2025-05-07T20:32:05.4410061Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.4410372Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.4410682Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.4411012Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.4411341Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.4411618Z     )
2025-05-07T20:32:05.4411967Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.4412414Z     def test_silu_mul_quant(
2025-05-07T20:32:05.4412652Z         self,
2025-05-07T20:32:05.4412846Z         T: int,
2025-05-07T20:32:05.4413045Z         D: int,
2025-05-07T20:32:05.4413261Z         scale_ub: Optional[float],
2025-05-07T20:32:05.4413531Z         contiguous: bool,
2025-05-07T20:32:05.4413770Z         compiled: bool,
2025-05-07T20:32:05.4413991Z     ) -> None:
2025-05-07T20:32:05.4414208Z         torch.manual_seed(2025)
2025-05-07T20:32:05.4414450Z     
2025-05-07T20:32:05.4414725Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.4415058Z     
2025-05-07T20:32:05.4415263Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.4415549Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.4415852Z         x = x_sign * x_clamp
2025-05-07T20:32:05.4416104Z         x0 = x[:, :D]
2025-05-07T20:32:05.4416327Z         x1 = x[:, D:]
2025-05-07T20:32:05.4416529Z     
2025-05-07T20:32:05.4416722Z         if contiguous:
2025-05-07T20:32:05.4416956Z             x0 = x0.contiguous()
2025-05-07T20:32:05.4417218Z             x1 = x1.contiguous()
2025-05-07T20:32:05.4417499Z     
2025-05-07T20:32:05.4417702Z         if scale_ub is not None:
2025-05-07T20:32:05.4417971Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.4418305Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.4418615Z             )
2025-05-07T20:32:05.4418803Z         else:
2025-05-07T20:32:05.4419022Z             scale_ub_tensor = None
2025-05-07T20:32:05.4419279Z     
2025-05-07T20:32:05.4419513Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.4419825Z             op = silu_mul_quant
2025-05-07T20:32:05.4420081Z             if compiled:
2025-05-07T20:32:05.4420327Z                 op = torch.compile(op)
2025-05-07T20:32:05.4420618Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.4420894Z     
2025-05-07T20:32:05.4421087Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.4421259Z 
2025-05-07T20:32:05.4421364Z moe/activation_test.py:117: 
2025-05-07T20:32:05.4421660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4421989Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.4422264Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.4422952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.4423642Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.4424262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.4424937Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.4425593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.4426117Z     kernel = self.compile(
2025-05-07T20:32:05.4426738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.4427383Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.4427774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4428001Z 
2025-05-07T20:32:05.4428210Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2c982430>
2025-05-07T20:32:05.4429300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.4430738Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2ccef5e0>}
2025-05-07T20:32:05.4432077Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.4433106Z context = <triton._C.libtriton.ir.context object at 0x7feb2c4d32b0>
2025-05-07T20:32:05.4433392Z 
2025-05-07T20:32:05.4433562Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.4434075Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.4434544Z                            module_map=module_map)
2025-05-07T20:32:05.4434911Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.4435262Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.4435518Z E       ^
2025-05-07T20:32:05.4435985Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.4436434Z 
2025-05-07T20:32:05.4436852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.4437365Z 
2025-05-07T20:32:05.4437466Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.4437880Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.4438286Z     T=1,
2025-05-07T20:32:05.4438465Z     D=5120,
2025-05-07T20:32:05.4438656Z     scale_ub=None,
2025-05-07T20:32:05.4438870Z     contiguous=True,
2025-05-07T20:32:05.4439094Z     compiled=True,
2025-05-07T20:32:05.4439291Z )
2025-05-07T20:32:05.9665384Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:05.9666578Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:32:05.9668807Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:05.9671776Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:05.9674831Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:05.9677284Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.9678593Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:05.9680114Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.9681530Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:05.9682787Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:32:05.9684008Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:05.9685234Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:32:05.9686276Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:05.9687289Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:32:05.9688511Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:05.9689794Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:05.9690914Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:05.9691945Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:32:05.9693117Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:05.9694466Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:05.9695521Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.9696431Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.9697166Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:32:05.9698177Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:06.1544684Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:06.1545744Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:32:06.1547097Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:06.1548647Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:06.1550089Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:06.1551472Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.1552772Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:06.1554148Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.1555561Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:06.1556806Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:32:06.1558075Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:06.1559286Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:32:06.1560326Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:06.1561350Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:32:06.1562571Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:06.1563859Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:06.1564980Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:06.1566028Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:32:06.1567205Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:06.1568649Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:06.1569712Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.1570617Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:06.1571429Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:32:06.1572440Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:06.6542790Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:06.6543350Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:06.6543730Z 
2025-05-07T20:32:06.6543851Z     @given(
2025-05-07T20:32:06.6544170Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:06.6544597Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:06.6545015Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:06.6545446Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:06.6545871Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:06.6546158Z     )
2025-05-07T20:32:06.6546511Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:06.6546953Z     def test_silu_mul_quant(
2025-05-07T20:32:06.6547194Z         self,
2025-05-07T20:32:06.6547394Z         T: int,
2025-05-07T20:32:06.6547597Z         D: int,
2025-05-07T20:32:06.6547816Z         scale_ub: Optional[float],
2025-05-07T20:32:06.6548093Z         contiguous: bool,
2025-05-07T20:32:06.6548334Z         compiled: bool,
2025-05-07T20:32:06.6548562Z     ) -> None:
2025-05-07T20:32:06.6548788Z         torch.manual_seed(2025)
2025-05-07T20:32:06.6549042Z     
2025-05-07T20:32:06.6549314Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:06.6549658Z     
2025-05-07T20:32:06.6549930Z         x_sign = torch.sign(x)
2025-05-07T20:32:06.6550220Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:06.6550537Z         x = x_sign * x_clamp
2025-05-07T20:32:06.6550781Z         x0 = x[:, :D]
2025-05-07T20:32:06.6551003Z         x1 = x[:, D:]
2025-05-07T20:32:06.6551210Z     
2025-05-07T20:32:06.6551401Z         if contiguous:
2025-05-07T20:32:06.6551638Z             x0 = x0.contiguous()
2025-05-07T20:32:06.6551898Z             x1 = x1.contiguous()
2025-05-07T20:32:06.6552144Z     
2025-05-07T20:32:06.6552342Z         if scale_ub is not None:
2025-05-07T20:32:06.6552620Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:06.6552963Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:06.6553277Z             )
2025-05-07T20:32:06.6553474Z         else:
2025-05-07T20:32:06.6553690Z             scale_ub_tensor = None
2025-05-07T20:32:06.6553947Z     
2025-05-07T20:32:06.6554181Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:06.6554502Z             op = silu_mul_quant
2025-05-07T20:32:06.6554759Z             if compiled:
2025-05-07T20:32:06.6555012Z                 op = torch.compile(op)
2025-05-07T20:32:06.6555316Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:06.6555602Z     
2025-05-07T20:32:06.6555799Z         y_fp8, y_scale = fn()
2025-05-07T20:32:06.6556085Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:06.6556382Z     
2025-05-07T20:32:06.6556626Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:06.6556964Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:06.6557262Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:06.6557583Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:06.6558119Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:06.6558441Z     
2025-05-07T20:32:06.6558647Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:06.6558845Z 
2025-05-07T20:32:06.6558947Z moe/activation_test.py:126: 
2025-05-07T20:32:06.6559254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:06.6559713Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:06.6560045Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:06.6560833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:06.6561599Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:06.6562148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:06.6562835Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:06.6563526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:06.6564253Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:06.6565006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:06.6565758Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:06.6566493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:06.6567136Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:06.6567739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:06.6568254Z     fn()
2025-05-07T20:32:06.6568767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:06.6569353Z     self.fn.run(
2025-05-07T20:32:06.6575489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:06.6576073Z     kernel = self.compile(
2025-05-07T20:32:06.6576624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:06.6577284Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.6577699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:06.6577928Z 
2025-05-07T20:32:06.6578142Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2c91ce80>
2025-05-07T20:32:06.6579221Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:06.6580599Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2ca9c670>}
2025-05-07T20:32:06.6581950Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:06.6582976Z context = <triton._C.libtriton.ir.context object at 0x7feb2c464a70>
2025-05-07T20:32:06.6583265Z 
2025-05-07T20:32:06.6583434Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:06.6583946Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.6584418Z                            module_map=module_map)
2025-05-07T20:32:06.6584784Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.6585250Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:06.6585521Z E       ^
2025-05-07T20:32:06.6585996Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:06.6586445Z 
2025-05-07T20:32:06.6586873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:06.6587510Z 
2025-05-07T20:32:06.6587613Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:06.6588024Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:06.6588433Z     T=2048,
2025-05-07T20:32:06.6588617Z     D=5120,
2025-05-07T20:32:06.6588810Z     scale_ub=None,
2025-05-07T20:32:06.6589024Z     contiguous=True,
2025-05-07T20:32:06.6589242Z     compiled=True,
2025-05-07T20:32:06.6589445Z )
2025-05-07T20:32:07.1445670Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:07.1446939Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:32:07.1449251Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:07.1452102Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:07.1454851Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:07.1457409Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.1458713Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:07.1460090Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.1461513Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:07.1462774Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:32:07.1463993Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:07.1465215Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:32:07.1466245Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:07.1467281Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:32:07.1468756Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:07.1470117Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:07.1471238Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:07.1472400Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:32:07.1473581Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:07.1474939Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:07.1476006Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.1476927Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.1477711Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:32:07.1478745Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.3329886Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:07.3332022Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:32:07.3334684Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:07.3337363Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:07.3338798Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:07.3340184Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.3341492Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:07.3342873Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.3344285Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:07.3345536Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:32:07.3346939Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:07.3348203Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:32:07.3349365Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:07.3350442Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:32:07.3351664Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:07.3352948Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:07.3354067Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:07.3355112Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:32:07.3356285Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:07.3357648Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:07.3358710Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.3359621Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.3360360Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:32:07.3361378Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.8368134Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:07.8368873Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:07.8369237Z 
2025-05-07T20:32:07.8369334Z     @given(
2025-05-07T20:32:07.8369643Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:07.8370015Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:07.8370321Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:07.8370644Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:07.8370978Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:07.8371256Z     )
2025-05-07T20:32:07.8371604Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:07.8372049Z     def test_silu_mul_quant(
2025-05-07T20:32:07.8372285Z         self,
2025-05-07T20:32:07.8372477Z         T: int,
2025-05-07T20:32:07.8372670Z         D: int,
2025-05-07T20:32:07.8372880Z         scale_ub: Optional[float],
2025-05-07T20:32:07.8373147Z         contiguous: bool,
2025-05-07T20:32:07.8373380Z         compiled: bool,
2025-05-07T20:32:07.8373596Z     ) -> None:
2025-05-07T20:32:07.8373810Z         torch.manual_seed(2025)
2025-05-07T20:32:07.8374214Z     
2025-05-07T20:32:07.8374485Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:07.8374826Z     
2025-05-07T20:32:07.8375015Z         x_sign = torch.sign(x)
2025-05-07T20:32:07.8375304Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:07.8375608Z         x = x_sign * x_clamp
2025-05-07T20:32:07.8375993Z         x0 = x[:, :D]
2025-05-07T20:32:07.8376205Z         x1 = x[:, D:]
2025-05-07T20:32:07.8376409Z     
2025-05-07T20:32:07.8376592Z         if contiguous:
2025-05-07T20:32:07.8376822Z             x0 = x0.contiguous()
2025-05-07T20:32:07.8377073Z             x1 = x1.contiguous()
2025-05-07T20:32:07.8377309Z     
2025-05-07T20:32:07.8377498Z         if scale_ub is not None:
2025-05-07T20:32:07.8377808Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:07.8378151Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:07.8378456Z             )
2025-05-07T20:32:07.8378644Z         else:
2025-05-07T20:32:07.8378860Z             scale_ub_tensor = None
2025-05-07T20:32:07.8379107Z     
2025-05-07T20:32:07.8379331Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.8379644Z             op = silu_mul_quant
2025-05-07T20:32:07.8379888Z             if compiled:
2025-05-07T20:32:07.8380129Z                 op = torch.compile(op)
2025-05-07T20:32:07.8380434Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.8380709Z     
2025-05-07T20:32:07.8380895Z         y_fp8, y_scale = fn()
2025-05-07T20:32:07.8381176Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:07.8381463Z     
2025-05-07T20:32:07.8381695Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.8382024Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:07.8382315Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:07.8382627Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:07.8382985Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:07.8383294Z     
2025-05-07T20:32:07.8383491Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:07.8383685Z 
2025-05-07T20:32:07.8383787Z moe/activation_test.py:126: 
2025-05-07T20:32:07.8384077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.8384414Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:07.8384746Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:07.8385534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:07.8386284Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:07.8386830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:07.8387514Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:07.8388200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:07.8388924Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:07.8389677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:07.8390488Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:07.8391220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:07.8391862Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:07.8392471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:07.8392986Z     fn()
2025-05-07T20:32:07.8393580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:07.8394164Z     self.fn.run(
2025-05-07T20:32:07.8394630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:07.8395152Z     kernel = self.compile(
2025-05-07T20:32:07.8395692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:07.8396422Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.8396813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.8397046Z 
2025-05-07T20:32:07.8397254Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2ce79670>
2025-05-07T20:32:07.8398395Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:07.8399793Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2c8dc9d0>}
2025-05-07T20:32:07.8401147Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:07.8402173Z context = <triton._C.libtriton.ir.context object at 0x7feb2c492930>
2025-05-07T20:32:07.8402469Z 
2025-05-07T20:32:07.8402637Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:07.8403153Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.8403614Z                            module_map=module_map)
2025-05-07T20:32:07.8404264Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.8404625Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:07.8404891Z E       ^
2025-05-07T20:32:07.8405342Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.8405801Z 
2025-05-07T20:32:07.8406217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:07.8406744Z 
2025-05-07T20:32:07.8406847Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:07.8407254Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:07.8407674Z     T=128,
2025-05-07T20:32:07.8407879Z     D=5120,
2025-05-07T20:32:07.8408064Z     scale_ub=None,
2025-05-07T20:32:07.8408270Z     contiguous=True,
2025-05-07T20:32:07.8408489Z     compiled=True,
2025-05-07T20:32:07.8408686Z )
2025-05-07T20:32:08.3724012Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:08.3726105Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:32:08.3728149Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:08.3729567Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:08.3730936Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:08.3732474Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.3733779Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:08.3735245Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.3736652Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:08.3737941Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:32:08.3739143Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:08.3740341Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:32:08.3741380Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:08.3742389Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:32:08.3743595Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:08.3744871Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:08.3745982Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:08.3747023Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:32:08.3748239Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:08.3749578Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:08.3750687Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.3751592Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.3752327Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:32:08.3753334Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.5656301Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:08.5657803Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:32:08.5660499Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:08.5663533Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:08.5666272Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:08.5668439Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.5669740Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:08.5671186Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.5672598Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:08.5673846Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:32:08.5675074Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:08.5676289Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:32:08.5677327Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:08.5678342Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:32:08.5679568Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:08.5680850Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:08.5681966Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:08.5683011Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:32:08.5684181Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:08.5685540Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:08.5686705Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.5687625Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.5688440Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:32:08.5689455Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.3810484Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.3811048Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:09.3811430Z 
2025-05-07T20:32:09.3811543Z     @given(
2025-05-07T20:32:09.3811852Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.3812257Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.3812593Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.3812921Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.3813245Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.3813539Z     )
2025-05-07T20:32:09.3813886Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.3814323Z     def test_silu_mul_quant(
2025-05-07T20:32:09.3814570Z         self,
2025-05-07T20:32:09.3814765Z         T: int,
2025-05-07T20:32:09.3814965Z         D: int,
2025-05-07T20:32:09.3815181Z         scale_ub: Optional[float],
2025-05-07T20:32:09.3815450Z         contiguous: bool,
2025-05-07T20:32:09.3815686Z         compiled: bool,
2025-05-07T20:32:09.3815910Z     ) -> None:
2025-05-07T20:32:09.3816124Z         torch.manual_seed(2025)
2025-05-07T20:32:09.3816364Z     
2025-05-07T20:32:09.3816634Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.3816972Z     
2025-05-07T20:32:09.3817164Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.3817447Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.3817753Z         x = x_sign * x_clamp
2025-05-07T20:32:09.3818024Z         x0 = x[:, :D]
2025-05-07T20:32:09.3818255Z         x1 = x[:, D:]
2025-05-07T20:32:09.3818459Z     
2025-05-07T20:32:09.3818643Z         if contiguous:
2025-05-07T20:32:09.3818864Z             x0 = x0.contiguous()
2025-05-07T20:32:09.3819125Z             x1 = x1.contiguous()
2025-05-07T20:32:09.3819360Z     
2025-05-07T20:32:09.3819543Z         if scale_ub is not None:
2025-05-07T20:32:09.3819817Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.3820147Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.3820453Z             )
2025-05-07T20:32:09.3820639Z         else:
2025-05-07T20:32:09.3820852Z             scale_ub_tensor = None
2025-05-07T20:32:09.3821096Z     
2025-05-07T20:32:09.3821322Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.3821632Z             op = silu_mul_quant
2025-05-07T20:32:09.3821900Z             if compiled:
2025-05-07T20:32:09.3822139Z                 op = torch.compile(op)
2025-05-07T20:32:09.3822438Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.3822708Z     
2025-05-07T20:32:09.3822901Z         y_fp8, y_scale = fn()
2025-05-07T20:32:09.3823177Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:09.3823464Z     
2025-05-07T20:32:09.3823695Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.3824016Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:09.3824304Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:09.3824615Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:09.3824960Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:09.3825441Z     
2025-05-07T20:32:09.3825648Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:09.3825841Z 
2025-05-07T20:32:09.3825945Z moe/activation_test.py:126: 
2025-05-07T20:32:09.3826233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.3826563Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:09.3826999Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:09.3827773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:09.3828580Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:09.3829122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.3829858Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.3830544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:09.3831261Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:09.3832007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:09.3832752Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:09.3833473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:09.3834108Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:09.3834700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:09.3835206Z     fn()
2025-05-07T20:32:09.3835704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:09.3836281Z     self.fn.run(
2025-05-07T20:32:09.3836738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.3837255Z     kernel = self.compile(
2025-05-07T20:32:09.3837789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.3838437Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.3838823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.3839054Z 
2025-05-07T20:32:09.3839260Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2cb10970>
2025-05-07T20:32:09.3840342Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.3841726Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2c87c940>}
2025-05-07T20:32:09.3843069Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.3844091Z context = <triton._C.libtriton.ir.context object at 0x7feb2bfaf030>
2025-05-07T20:32:09.3844384Z 
2025-05-07T20:32:09.3844545Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.3845067Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.3845528Z                            module_map=module_map)
2025-05-07T20:32:09.3845885Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.3846237Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:09.3846577Z E       ^
2025-05-07T20:32:09.3847034Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.3847484Z 
2025-05-07T20:32:09.3847921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.3848529Z 
2025-05-07T20:32:09.3848629Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.3849037Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.3849428Z     T=4096,
2025-05-07T20:32:09.3849609Z     D=5120,
2025-05-07T20:32:09.3849797Z     scale_ub=None,
2025-05-07T20:32:09.3850002Z     contiguous=True,
2025-05-07T20:32:09.3850215Z     compiled=True,
2025-05-07T20:32:09.3850413Z )
2025-05-07T20:32:09.9161405Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:09.9162475Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:09.9163808Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:09.9165226Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:09.9166598Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:09.9167983Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.9169279Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:09.9170640Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.9172048Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:09.9173288Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:09.9174496Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:09.9175698Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:09.9176730Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:09.9177822Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:09.9179527Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:09.9180931Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:09.9182035Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:09.9183181Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:09.9184347Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:09.9185694Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:09.9186744Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.9187688Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.9188616Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:09.9189863Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.1095317Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:10.1097403Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:10.1098999Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:10.1100429Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:10.1107735Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:10.1109155Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.1110525Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:10.1111910Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.1113338Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:10.1114604Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:10.1115995Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:10.1117216Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:10.1118373Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:10.1119408Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:10.1120627Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:10.1121910Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:10.1123021Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:10.1124060Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:10.1125237Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:10.1126593Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:10.1127659Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.1128565Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.1129309Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:10.1130326Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.7836622Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.7837168Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:10.7837451Z 
2025-05-07T20:32:10.7837534Z     @given(
2025-05-07T20:32:10.7837790Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.7838109Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.7838422Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.7838762Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.7839092Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.7839383Z     )
2025-05-07T20:32:10.7839742Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.7840200Z     def test_silu_mul_quant(
2025-05-07T20:32:10.7840442Z         self,
2025-05-07T20:32:10.7840643Z         T: int,
2025-05-07T20:32:10.7840852Z         D: int,
2025-05-07T20:32:10.7841072Z         scale_ub: Optional[float],
2025-05-07T20:32:10.7841349Z         contiguous: bool,
2025-05-07T20:32:10.7841590Z         compiled: bool,
2025-05-07T20:32:10.7841816Z     ) -> None:
2025-05-07T20:32:10.7842042Z         torch.manual_seed(2025)
2025-05-07T20:32:10.7842294Z     
2025-05-07T20:32:10.7842761Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.7843115Z     
2025-05-07T20:32:10.7843318Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.7843609Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.7843917Z         x = x_sign * x_clamp
2025-05-07T20:32:10.7844161Z         x0 = x[:, :D]
2025-05-07T20:32:10.7844491Z         x1 = x[:, D:]
2025-05-07T20:32:10.7844700Z     
2025-05-07T20:32:10.7844888Z         if contiguous:
2025-05-07T20:32:10.7845116Z             x0 = x0.contiguous()
2025-05-07T20:32:10.7845379Z             x1 = x1.contiguous()
2025-05-07T20:32:10.7845620Z     
2025-05-07T20:32:10.7845807Z         if scale_ub is not None:
2025-05-07T20:32:10.7846085Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.7846423Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.7846731Z             )
2025-05-07T20:32:10.7846928Z         else:
2025-05-07T20:32:10.7847134Z             scale_ub_tensor = None
2025-05-07T20:32:10.7847393Z     
2025-05-07T20:32:10.7847627Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.7847938Z             op = silu_mul_quant
2025-05-07T20:32:10.7848190Z             if compiled:
2025-05-07T20:32:10.7848439Z                 op = torch.compile(op)
2025-05-07T20:32:10.7848739Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.7849013Z     
2025-05-07T20:32:10.7849210Z         y_fp8, y_scale = fn()
2025-05-07T20:32:10.7849499Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:10.7849782Z     
2025-05-07T20:32:10.7850022Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.7850362Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:10.7850652Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:10.7850968Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:10.7851330Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:10.7851631Z     
2025-05-07T20:32:10.7851839Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:10.7852057Z 
2025-05-07T20:32:10.7852161Z moe/activation_test.py:126: 
2025-05-07T20:32:10.7852462Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.7852791Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:10.7853128Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:10.7853926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:10.7854686Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:10.7855229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.7855913Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.7856605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:10.7857327Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:10.7858071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:10.7858819Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:10.7859544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:10.7860173Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:10.7860771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:10.7861285Z     fn()
2025-05-07T20:32:10.7861788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:10.7862445Z     self.fn.run(
2025-05-07T20:32:10.7862914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.7863441Z     kernel = self.compile(
2025-05-07T20:32:10.7863974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.7864697Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.7865089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.7865316Z 
2025-05-07T20:32:10.7865532Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2c603610>
2025-05-07T20:32:10.7866613Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.7868002Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2c5e9700>}
2025-05-07T20:32:10.7869348Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.7870462Z context = <triton._C.libtriton.ir.context object at 0x7feb2ba7a470>
2025-05-07T20:32:10.7870748Z 
2025-05-07T20:32:10.7870918Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.7871433Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.7871898Z                            module_map=module_map)
2025-05-07T20:32:10.7872262Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.7872612Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:10.7872878Z E       ^
2025-05-07T20:32:10.7873353Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.7873800Z 
2025-05-07T20:32:10.7874225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.7874743Z 
2025-05-07T20:32:10.7874849Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.7875263Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.7875670Z     T=16384,
2025-05-07T20:32:10.7875863Z     D=5120,
2025-05-07T20:32:10.7876061Z     scale_ub=None,
2025-05-07T20:32:10.7876275Z     contiguous=True,
2025-05-07T20:32:10.7876494Z     compiled=True,
2025-05-07T20:32:10.7876699Z )
2025-05-07T20:32:10.8319579Z W0507 20:32:10.830472 87440 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:10.8320823Z W0507 20:32:10.830472 87440 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:10.8322149Z W0507 20:32:10.830472 87440 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:10.8323132Z W0507 20:32:10.830472 87440 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:10.8324228Z W0507 20:32:10.830472 87440 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:10.9563850Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.9564381Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:10.9564657Z 
2025-05-07T20:32:10.9564905Z     @given(
2025-05-07T20:32:10.9565192Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.9565630Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.9566032Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.9566464Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.9566940Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.9567230Z     )
2025-05-07T20:32:10.9567580Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.9568029Z     def test_silu_mul_quant(
2025-05-07T20:32:10.9568307Z         self,
2025-05-07T20:32:10.9568513Z         T: int,
2025-05-07T20:32:10.9568711Z         D: int,
2025-05-07T20:32:10.9568932Z         scale_ub: Optional[float],
2025-05-07T20:32:10.9569207Z         contiguous: bool,
2025-05-07T20:32:10.9569449Z         compiled: bool,
2025-05-07T20:32:10.9569676Z     ) -> None:
2025-05-07T20:32:10.9569900Z         torch.manual_seed(2025)
2025-05-07T20:32:10.9570140Z     
2025-05-07T20:32:10.9570415Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.9570761Z     
2025-05-07T20:32:10.9570955Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.9571248Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.9571567Z         x = x_sign * x_clamp
2025-05-07T20:32:10.9571802Z         x0 = x[:, :D]
2025-05-07T20:32:10.9572023Z         x1 = x[:, D:]
2025-05-07T20:32:10.9572236Z     
2025-05-07T20:32:10.9572422Z         if contiguous:
2025-05-07T20:32:10.9572661Z             x0 = x0.contiguous()
2025-05-07T20:32:10.9572953Z             x1 = x1.contiguous()
2025-05-07T20:32:10.9573195Z     
2025-05-07T20:32:10.9573382Z         if scale_ub is not None:
2025-05-07T20:32:10.9573659Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.9573999Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.9574309Z             )
2025-05-07T20:32:10.9574505Z         else:
2025-05-07T20:32:10.9574716Z             scale_ub_tensor = None
2025-05-07T20:32:10.9574964Z     
2025-05-07T20:32:10.9575203Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.9575516Z             op = silu_mul_quant
2025-05-07T20:32:10.9575766Z             if compiled:
2025-05-07T20:32:10.9576012Z                 op = torch.compile(op)
2025-05-07T20:32:10.9576315Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.9576591Z     
2025-05-07T20:32:10.9576778Z         y_fp8, y_scale = fn()
2025-05-07T20:32:10.9577065Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:10.9577358Z     
2025-05-07T20:32:10.9577592Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.9577925Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:10.9578217Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:10.9578557Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:10.9578946Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:10.9579258Z     
2025-05-07T20:32:10.9579464Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:10.9579659Z 
2025-05-07T20:32:10.9579759Z moe/activation_test.py:126: 
2025-05-07T20:32:10.9580063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.9580403Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:10.9580728Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:10.9581524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:10.9582292Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:10.9582844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.9583532Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.9584308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:10.9585051Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:10.9585805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:10.9586636Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:10.9587375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:10.9588028Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:10.9588627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:10.9589155Z     fn()
2025-05-07T20:32:10.9589667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:10.9590364Z     self.fn.run(
2025-05-07T20:32:10.9590828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.9591365Z     kernel = self.compile(
2025-05-07T20:32:10.9591916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.9592574Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.9592979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.9593218Z 
2025-05-07T20:32:10.9593428Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2bbfbb20>
2025-05-07T20:32:10.9594537Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.9595940Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2c052d30>}
2025-05-07T20:32:10.9597303Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.9598398Z context = <triton._C.libtriton.ir.context object at 0x7feb2b639270>
2025-05-07T20:32:10.9598687Z 
2025-05-07T20:32:10.9598862Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.9599388Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.9599852Z                            module_map=module_map)
2025-05-07T20:32:10.9600220Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.9600580Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:10.9600843Z E       ^
2025-05-07T20:32:10.9601317Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.9601769Z 
2025-05-07T20:32:10.9602190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.9602715Z 
2025-05-07T20:32:10.9602825Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.9603238Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.9603643Z     T=1,
2025-05-07T20:32:10.9604080Z     D=5120,
2025-05-07T20:32:10.9604266Z     scale_ub=1200.0,
2025-05-07T20:32:10.9604488Z     contiguous=True,
2025-05-07T20:32:10.9604713Z     compiled=True,
2025-05-07T20:32:10.9604912Z )
2025-05-07T20:32:11.1348410Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.1349473Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:11.1349883Z 
2025-05-07T20:32:11.1349963Z     @given(
2025-05-07T20:32:11.1350194Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.1350501Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.1350806Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.1351248Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.1351572Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.1351858Z     )
2025-05-07T20:32:11.1352206Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.1352644Z     def test_silu_mul_quant(
2025-05-07T20:32:11.1352885Z         self,
2025-05-07T20:32:11.1353082Z         T: int,
2025-05-07T20:32:11.1353273Z         D: int,
2025-05-07T20:32:11.1353495Z         scale_ub: Optional[float],
2025-05-07T20:32:11.1353762Z         contiguous: bool,
2025-05-07T20:32:11.1354013Z         compiled: bool,
2025-05-07T20:32:11.1354230Z     ) -> None:
2025-05-07T20:32:11.1354444Z         torch.manual_seed(2025)
2025-05-07T20:32:11.1354683Z     
2025-05-07T20:32:11.1354950Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.1355290Z     
2025-05-07T20:32:11.1355485Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.1355776Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.1356085Z         x = x_sign * x_clamp
2025-05-07T20:32:11.1356327Z         x0 = x[:, :D]
2025-05-07T20:32:11.1356537Z         x1 = x[:, D:]
2025-05-07T20:32:11.1356742Z     
2025-05-07T20:32:11.1356929Z         if contiguous:
2025-05-07T20:32:11.1357153Z             x0 = x0.contiguous()
2025-05-07T20:32:11.1357410Z             x1 = x1.contiguous()
2025-05-07T20:32:11.1357652Z     
2025-05-07T20:32:11.1357837Z         if scale_ub is not None:
2025-05-07T20:32:11.1358102Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.1358439Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.1358748Z             )
2025-05-07T20:32:11.1358933Z         else:
2025-05-07T20:32:11.1359140Z             scale_ub_tensor = None
2025-05-07T20:32:11.1359392Z     
2025-05-07T20:32:11.1359625Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.1359944Z             op = silu_mul_quant
2025-05-07T20:32:11.1360200Z             if compiled:
2025-05-07T20:32:11.1360444Z                 op = torch.compile(op)
2025-05-07T20:32:11.1360740Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.1361019Z     
2025-05-07T20:32:11.1361201Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.1361372Z 
2025-05-07T20:32:11.1361470Z moe/activation_test.py:117: 
2025-05-07T20:32:11.1361765Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.1362094Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.1362371Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.1362932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.1363484Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.1364139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.1364831Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.1365366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.1366037Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.1366686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.1367210Z     kernel = self.compile(
2025-05-07T20:32:11.1367744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.1368498Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.1368913Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.1369146Z 
2025-05-07T20:32:11.1369349Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2ccff220>
2025-05-07T20:32:11.1370508Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.1371888Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2bb5ec10>}
2025-05-07T20:32:11.1373230Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.1374245Z context = <triton._C.libtriton.ir.context object at 0x7feb2b4f4970>
2025-05-07T20:32:11.1374532Z 
2025-05-07T20:32:11.1374706Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.1375222Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.1375681Z                            module_map=module_map)
2025-05-07T20:32:11.1376047Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.1376396Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.1376641Z E       ^
2025-05-07T20:32:11.1377107Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.1377564Z 
2025-05-07T20:32:11.1377979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.1378496Z 
2025-05-07T20:32:11.1378607Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.1379057Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.1379461Z     T=1,
2025-05-07T20:32:11.1379641Z     D=5120,
2025-05-07T20:32:11.1379824Z     scale_ub=None,
2025-05-07T20:32:11.1380038Z     contiguous=False,
2025-05-07T20:32:11.1380260Z     compiled=True,
2025-05-07T20:32:11.1380448Z )
2025-05-07T20:32:11.2211163Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.2211770Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:11.2212133Z 
2025-05-07T20:32:11.2212240Z     @given(
2025-05-07T20:32:11.2212468Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.2212782Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.2213089Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.2213416Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.2213746Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.2214028Z     )
2025-05-07T20:32:11.2214371Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.2214803Z     def test_silu_mul_quant(
2025-05-07T20:32:11.2215042Z         self,
2025-05-07T20:32:11.2215232Z         T: int,
2025-05-07T20:32:11.2215430Z         D: int,
2025-05-07T20:32:11.2215645Z         scale_ub: Optional[float],
2025-05-07T20:32:11.2215916Z         contiguous: bool,
2025-05-07T20:32:11.2216148Z         compiled: bool,
2025-05-07T20:32:11.2216372Z     ) -> None:
2025-05-07T20:32:11.2216587Z         torch.manual_seed(2025)
2025-05-07T20:32:11.2216824Z     
2025-05-07T20:32:11.2217093Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.2217430Z     
2025-05-07T20:32:11.2217620Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.2217906Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.2218216Z         x = x_sign * x_clamp
2025-05-07T20:32:11.2218661Z         x0 = x[:, :D]
2025-05-07T20:32:11.2218877Z         x1 = x[:, D:]
2025-05-07T20:32:11.2219088Z     
2025-05-07T20:32:11.2219273Z         if contiguous:
2025-05-07T20:32:11.2219501Z             x0 = x0.contiguous()
2025-05-07T20:32:11.2219757Z             x1 = x1.contiguous()
2025-05-07T20:32:11.2220105Z     
2025-05-07T20:32:11.2220289Z         if scale_ub is not None:
2025-05-07T20:32:11.2220560Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.2220900Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.2221203Z             )
2025-05-07T20:32:11.2221392Z         else:
2025-05-07T20:32:11.2221599Z             scale_ub_tensor = None
2025-05-07T20:32:11.2221841Z     
2025-05-07T20:32:11.2222076Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.2222388Z             op = silu_mul_quant
2025-05-07T20:32:11.2222636Z             if compiled:
2025-05-07T20:32:11.2222874Z                 op = torch.compile(op)
2025-05-07T20:32:11.2223174Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.2223445Z     
2025-05-07T20:32:11.2223627Z         y_fp8, y_scale = fn()
2025-05-07T20:32:11.2223906Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:11.2224190Z     
2025-05-07T20:32:11.2224420Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.2224754Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:11.2225046Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:11.2225356Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:11.2225709Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.2226014Z     
2025-05-07T20:32:11.2226206Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:11.2226404Z 
2025-05-07T20:32:11.2226520Z moe/activation_test.py:126: 
2025-05-07T20:32:11.2226816Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.2227159Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:11.2227478Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.2228270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:11.2235297Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:11.2235869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.2236546Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.2237233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:11.2237956Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:11.2238761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:11.2239496Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:11.2240218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:11.2240857Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:11.2241458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:11.2241968Z     fn()
2025-05-07T20:32:11.2242480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:11.2243060Z     self.fn.run(
2025-05-07T20:32:11.2243522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.2244054Z     kernel = self.compile(
2025-05-07T20:32:11.2244698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.2245351Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.2245742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.2245982Z 
2025-05-07T20:32:11.2246266Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2bbe6b20>
2025-05-07T20:32:11.2247360Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.2248792Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2c5f81f0>}
2025-05-07T20:32:11.2250137Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.2251168Z context = <triton._C.libtriton.ir.context object at 0x7feb2b0a1670>
2025-05-07T20:32:11.2251465Z 
2025-05-07T20:32:11.2251630Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.2252161Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.2252620Z                            module_map=module_map)
2025-05-07T20:32:11.2252987Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.2253347Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:11.2253608Z E       ^
2025-05-07T20:32:11.2254076Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.2254537Z 
2025-05-07T20:32:11.2254959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.2255471Z 
2025-05-07T20:32:11.2255583Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.2255988Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.2256381Z     T=1,
2025-05-07T20:32:11.2256563Z     D=5120,
2025-05-07T20:32:11.2256757Z     scale_ub=None,
2025-05-07T20:32:11.2256967Z     contiguous=True,
2025-05-07T20:32:11.2257191Z     compiled=False,
2025-05-07T20:32:11.2257398Z )
2025-05-07T20:32:11.5961895Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.5962507Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:11.5962872Z 
2025-05-07T20:32:11.5962975Z     @given(
2025-05-07T20:32:11.5963291Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.5963711Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.5964124Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.5964491Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.5964818Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.5965100Z     )
2025-05-07T20:32:11.5965449Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.5965893Z     def test_silu_mul_quant(
2025-05-07T20:32:11.5966142Z         self,
2025-05-07T20:32:11.5966334Z         T: int,
2025-05-07T20:32:11.5966531Z         D: int,
2025-05-07T20:32:11.5966747Z         scale_ub: Optional[float],
2025-05-07T20:32:11.5967017Z         contiguous: bool,
2025-05-07T20:32:11.5967254Z         compiled: bool,
2025-05-07T20:32:11.5967480Z     ) -> None:
2025-05-07T20:32:11.5967702Z         torch.manual_seed(2025)
2025-05-07T20:32:11.5967939Z     
2025-05-07T20:32:11.5968215Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.5968607Z     
2025-05-07T20:32:11.5968793Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.5969252Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.5969567Z         x = x_sign * x_clamp
2025-05-07T20:32:11.5969801Z         x0 = x[:, :D]
2025-05-07T20:32:11.5970018Z         x1 = x[:, D:]
2025-05-07T20:32:11.5970227Z     
2025-05-07T20:32:11.5970405Z         if contiguous:
2025-05-07T20:32:11.5970639Z             x0 = x0.contiguous()
2025-05-07T20:32:11.5971021Z             x1 = x1.contiguous()
2025-05-07T20:32:11.5971259Z     
2025-05-07T20:32:11.5971453Z         if scale_ub is not None:
2025-05-07T20:32:11.5971727Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.5972064Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.5972366Z             )
2025-05-07T20:32:11.5972566Z         else:
2025-05-07T20:32:11.5972776Z             scale_ub_tensor = None
2025-05-07T20:32:11.5973021Z     
2025-05-07T20:32:11.5973253Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.5973568Z             op = silu_mul_quant
2025-05-07T20:32:11.5973816Z             if compiled:
2025-05-07T20:32:11.5974062Z                 op = torch.compile(op)
2025-05-07T20:32:11.5974360Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.5974630Z     
2025-05-07T20:32:11.5974827Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.5974994Z 
2025-05-07T20:32:11.5975106Z moe/activation_test.py:117: 
2025-05-07T20:32:11.5975395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.5975732Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.5976016Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.5976706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.5977388Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.5977918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.5978629Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.5979305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.5979834Z     kernel = self.compile(
2025-05-07T20:32:11.5980374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.5981026Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.5981419Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.5981646Z 
2025-05-07T20:32:11.5981853Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2c1f4c10>
2025-05-07T20:32:11.5982933Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.5984314Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2bb5eb80>}
2025-05-07T20:32:11.5985652Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.5986685Z context = <triton._C.libtriton.ir.context object at 0x7feb2b460770>
2025-05-07T20:32:11.5986974Z 
2025-05-07T20:32:11.5987142Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.5987663Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.5988130Z                            module_map=module_map)
2025-05-07T20:32:11.5988496Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.5988852Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.5989193Z E       ^
2025-05-07T20:32:11.5989664Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.5990201Z 
2025-05-07T20:32:11.5990618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.5991205Z 
2025-05-07T20:32:11.5991316Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.5991731Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.5992160Z     T=128,
2025-05-07T20:32:11.5992346Z     D=5120,
2025-05-07T20:32:11.5992539Z     scale_ub=None,
2025-05-07T20:32:11.5992753Z     contiguous=False,
2025-05-07T20:32:11.5992974Z     compiled=True,
2025-05-07T20:32:11.5993177Z )
2025-05-07T20:32:11.5993492Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.5993988Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:11.5994260Z 
2025-05-07T20:32:11.5994336Z     @given(
2025-05-07T20:32:11.5994565Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.5994869Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.5995180Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.5995511Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.5995840Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.5996119Z     )
2025-05-07T20:32:11.5996465Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.5996901Z     def test_silu_mul_quant(
2025-05-07T20:32:11.5997137Z         self,
2025-05-07T20:32:11.5997330Z         T: int,
2025-05-07T20:32:11.5997534Z         D: int,
2025-05-07T20:32:11.5997746Z         scale_ub: Optional[float],
2025-05-07T20:32:11.5998016Z         contiguous: bool,
2025-05-07T20:32:11.5998279Z         compiled: bool,
2025-05-07T20:32:11.5998529Z     ) -> None:
2025-05-07T20:32:11.5998749Z         torch.manual_seed(2025)
2025-05-07T20:32:11.5998994Z     
2025-05-07T20:32:11.5999261Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.5999599Z     
2025-05-07T20:32:11.5999788Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6000077Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6000385Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6000623Z         x0 = x[:, :D]
2025-05-07T20:32:11.6000838Z         x1 = x[:, D:]
2025-05-07T20:32:11.6001043Z     
2025-05-07T20:32:11.6001229Z         if contiguous:
2025-05-07T20:32:11.6001458Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6001709Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6001948Z     
2025-05-07T20:32:11.6002140Z         if scale_ub is not None:
2025-05-07T20:32:11.6002407Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6002745Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6003055Z             )
2025-05-07T20:32:11.6003242Z         else:
2025-05-07T20:32:11.6003451Z             scale_ub_tensor = None
2025-05-07T20:32:11.6003869Z     
2025-05-07T20:32:11.6004097Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6004407Z             op = silu_mul_quant
2025-05-07T20:32:11.6004655Z             if compiled:
2025-05-07T20:32:11.6004892Z                 op = torch.compile(op)
2025-05-07T20:32:11.6005186Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6005458Z     
2025-05-07T20:32:11.6005634Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6005799Z 
2025-05-07T20:32:11.6005896Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6006188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6006514Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6006783Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6007456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.6008017Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.6008724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6009406Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6010068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6010739Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6011393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6011921Z     kernel = self.compile(
2025-05-07T20:32:11.6012453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6013101Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6013484Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6013715Z 
2025-05-07T20:32:11.6013919Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2af56eb0>
2025-05-07T20:32:11.6014993Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6016368Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2c5e9a60>}
2025-05-07T20:32:11.6017700Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6018721Z context = <triton._C.libtriton.ir.context object at 0x7feb2b3e39f0>
2025-05-07T20:32:11.6019010Z 
2025-05-07T20:32:11.6019173Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6019686Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6020149Z                            module_map=module_map)
2025-05-07T20:32:11.6020506Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6020852Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6021100Z E       ^
2025-05-07T20:32:11.6021556Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6022005Z 
2025-05-07T20:32:11.6022418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6022924Z 
2025-05-07T20:32:11.6023030Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6023435Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6023836Z     T=128,
2025-05-07T20:32:11.6024014Z     D=7168,
2025-05-07T20:32:11.6024197Z     scale_ub=1200.0,
2025-05-07T20:32:11.6024414Z     contiguous=False,
2025-05-07T20:32:11.6024633Z     compiled=False,
2025-05-07T20:32:11.6024833Z )
2025-05-07T20:32:11.7574797Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.7575516Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:11.7575900Z 
2025-05-07T20:32:11.7576008Z     @given(
2025-05-07T20:32:11.7576305Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.7576725Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.7577140Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.7577547Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.7577870Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.7578316Z     )
2025-05-07T20:32:11.7578667Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.7579099Z     def test_silu_mul_quant(
2025-05-07T20:32:11.7579341Z         self,
2025-05-07T20:32:11.7579535Z         T: int,
2025-05-07T20:32:11.7579729Z         D: int,
2025-05-07T20:32:11.7580058Z         scale_ub: Optional[float],
2025-05-07T20:32:11.7580329Z         contiguous: bool,
2025-05-07T20:32:11.7580563Z         compiled: bool,
2025-05-07T20:32:11.7580785Z     ) -> None:
2025-05-07T20:32:11.7580997Z         torch.manual_seed(2025)
2025-05-07T20:32:11.7581233Z     
2025-05-07T20:32:11.7581510Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.7581879Z     
2025-05-07T20:32:11.7582072Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.7582364Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.7582667Z         x = x_sign * x_clamp
2025-05-07T20:32:11.7582911Z         x0 = x[:, :D]
2025-05-07T20:32:11.7583131Z         x1 = x[:, D:]
2025-05-07T20:32:11.7583337Z     
2025-05-07T20:32:11.7583521Z         if contiguous:
2025-05-07T20:32:11.7583753Z             x0 = x0.contiguous()
2025-05-07T20:32:11.7584005Z             x1 = x1.contiguous()
2025-05-07T20:32:11.7584243Z     
2025-05-07T20:32:11.7584442Z         if scale_ub is not None:
2025-05-07T20:32:11.7584715Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.7585044Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.7585356Z             )
2025-05-07T20:32:11.7585545Z         else:
2025-05-07T20:32:11.7585752Z             scale_ub_tensor = None
2025-05-07T20:32:11.7586003Z     
2025-05-07T20:32:11.7586232Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.7586536Z             op = silu_mul_quant
2025-05-07T20:32:11.7586785Z             if compiled:
2025-05-07T20:32:11.7587031Z                 op = torch.compile(op)
2025-05-07T20:32:11.7587325Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.7587599Z     
2025-05-07T20:32:11.7587791Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.7587954Z 
2025-05-07T20:32:11.7588057Z moe/activation_test.py:117: 
2025-05-07T20:32:11.7588357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.7588689Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.7588972Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.7589653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.7590419Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.7590953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.7591627Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.7592285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.7592808Z     kernel = self.compile(
2025-05-07T20:32:11.7593340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.7593985Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.7594383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.7594612Z 
2025-05-07T20:32:11.7594821Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2aef2f40>
2025-05-07T20:32:11.7595907Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.7597827Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2b75d4c0>}
2025-05-07T20:32:11.7599188Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.7600212Z context = <triton._C.libtriton.ir.context object at 0x7feb2b447370>
2025-05-07T20:32:11.7600573Z 
2025-05-07T20:32:11.7600742Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.7601259Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.7601725Z                            module_map=module_map)
2025-05-07T20:32:11.7602086Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.7602436Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.7602691Z E       ^
2025-05-07T20:32:11.7603163Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.7603616Z 
2025-05-07T20:32:11.7604205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.7604716Z 
2025-05-07T20:32:11.7604822Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.7605239Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.7605643Z     T=128,
2025-05-07T20:32:11.7605833Z     D=5120,
2025-05-07T20:32:11.7606018Z     scale_ub=None,
2025-05-07T20:32:11.7606233Z     contiguous=False,
2025-05-07T20:32:11.7606457Z     compiled=False,
2025-05-07T20:32:11.7606655Z )
2025-05-07T20:32:11.7606970Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.7607456Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:11.7607726Z 
2025-05-07T20:32:11.7607801Z     @given(
2025-05-07T20:32:11.7608037Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.7608376Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.7608710Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.7609034Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.7609360Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.7609648Z     )
2025-05-07T20:32:11.7609991Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.7610429Z     def test_silu_mul_quant(
2025-05-07T20:32:11.7610672Z         self,
2025-05-07T20:32:11.7610857Z         T: int,
2025-05-07T20:32:11.7611055Z         D: int,
2025-05-07T20:32:11.7611273Z         scale_ub: Optional[float],
2025-05-07T20:32:11.7611536Z         contiguous: bool,
2025-05-07T20:32:11.7611776Z         compiled: bool,
2025-05-07T20:32:11.7611998Z     ) -> None:
2025-05-07T20:32:11.7612208Z         torch.manual_seed(2025)
2025-05-07T20:32:11.7612448Z     
2025-05-07T20:32:11.7612723Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.7613059Z     
2025-05-07T20:32:11.7613250Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.7613538Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.7613851Z         x = x_sign * x_clamp
2025-05-07T20:32:11.7614090Z         x0 = x[:, :D]
2025-05-07T20:32:11.7614301Z         x1 = x[:, D:]
2025-05-07T20:32:11.7614506Z     
2025-05-07T20:32:11.7614681Z         if contiguous:
2025-05-07T20:32:11.7614910Z             x0 = x0.contiguous()
2025-05-07T20:32:11.7615164Z             x1 = x1.contiguous()
2025-05-07T20:32:11.7615403Z     
2025-05-07T20:32:11.7615591Z         if scale_ub is not None:
2025-05-07T20:32:11.7615860Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.7616189Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.7616497Z             )
2025-05-07T20:32:11.7616691Z         else:
2025-05-07T20:32:11.7616896Z             scale_ub_tensor = None
2025-05-07T20:32:11.7617328Z     
2025-05-07T20:32:11.7617564Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.7617872Z             op = silu_mul_quant
2025-05-07T20:32:11.7618125Z             if compiled:
2025-05-07T20:32:11.7618395Z                 op = torch.compile(op)
2025-05-07T20:32:11.7618826Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.7619100Z     
2025-05-07T20:32:11.7619290Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.7619452Z 
2025-05-07T20:32:11.7619557Z moe/activation_test.py:117: 
2025-05-07T20:32:11.7619844Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.7620171Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.7620449Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.7621140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.7621835Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.7622372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.7623046Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.7623694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.7624221Z     kernel = self.compile(
2025-05-07T20:32:11.7624757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.7625396Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.7625788Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.7626019Z 
2025-05-07T20:32:11.7626224Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2c676b20>
2025-05-07T20:32:11.7627313Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.7628718Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2bc2aee0>}
2025-05-07T20:32:11.7630144Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.7631168Z context = <triton._C.libtriton.ir.context object at 0x7feb2aee6df0>
2025-05-07T20:32:11.7631460Z 
2025-05-07T20:32:11.7631625Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.7632142Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.7632617Z                            module_map=module_map)
2025-05-07T20:32:11.7632982Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.7633333Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.7633584Z E       ^
2025-05-07T20:32:11.7634049Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.7634505Z 
2025-05-07T20:32:11.7634919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.7635428Z 
2025-05-07T20:32:11.7635534Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.7635939Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.7636344Z     T=128,
2025-05-07T20:32:11.7636533Z     D=5120,
2025-05-07T20:32:11.7636720Z     scale_ub=1200.0,
2025-05-07T20:32:11.7636938Z     contiguous=True,
2025-05-07T20:32:11.7637162Z     compiled=False,
2025-05-07T20:32:11.7637361Z )
2025-05-07T20:32:11.9956450Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.9957921Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:11.9958569Z 
2025-05-07T20:32:11.9958695Z     @given(
2025-05-07T20:32:11.9959012Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.9959564Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.9959949Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.9960276Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.9960602Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.9960887Z     )
2025-05-07T20:32:11.9961232Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.9961680Z     def test_silu_mul_quant(
2025-05-07T20:32:11.9961924Z         self,
2025-05-07T20:32:11.9962117Z         T: int,
2025-05-07T20:32:11.9962311Z         D: int,
2025-05-07T20:32:11.9962541Z         scale_ub: Optional[float],
2025-05-07T20:32:11.9962818Z         contiguous: bool,
2025-05-07T20:32:11.9963053Z         compiled: bool,
2025-05-07T20:32:11.9963278Z     ) -> None:
2025-05-07T20:32:11.9963495Z         torch.manual_seed(2025)
2025-05-07T20:32:11.9963733Z     
2025-05-07T20:32:11.9964017Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.9964364Z     
2025-05-07T20:32:11.9964561Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.9964849Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.9965165Z         x = x_sign * x_clamp
2025-05-07T20:32:11.9965400Z         x0 = x[:, :D]
2025-05-07T20:32:11.9965619Z         x1 = x[:, D:]
2025-05-07T20:32:11.9965835Z     
2025-05-07T20:32:11.9966018Z         if contiguous:
2025-05-07T20:32:11.9966248Z             x0 = x0.contiguous()
2025-05-07T20:32:11.9966506Z             x1 = x1.contiguous()
2025-05-07T20:32:11.9966742Z     
2025-05-07T20:32:11.9966938Z         if scale_ub is not None:
2025-05-07T20:32:11.9967215Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.9967556Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.9967867Z             )
2025-05-07T20:32:11.9968065Z         else:
2025-05-07T20:32:11.9968280Z             scale_ub_tensor = None
2025-05-07T20:32:11.9968528Z     
2025-05-07T20:32:11.9968760Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.9969076Z             op = silu_mul_quant
2025-05-07T20:32:11.9969322Z             if compiled:
2025-05-07T20:32:11.9969569Z                 op = torch.compile(op)
2025-05-07T20:32:11.9969865Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.9970135Z     
2025-05-07T20:32:11.9970329Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.9970493Z 
2025-05-07T20:32:11.9970597Z moe/activation_test.py:117: 
2025-05-07T20:32:11.9970886Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.9971222Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.9971503Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.9972189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.9972875Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.9973414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.9974089Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.9974749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.9975280Z     kernel = self.compile(
2025-05-07T20:32:11.9975820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.9976554Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.9976946Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.9977171Z 
2025-05-07T20:32:11.9977381Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2b4a8610>
2025-05-07T20:32:11.9978625Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.9980468Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2b75d280>}
2025-05-07T20:32:11.9981845Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.9982867Z context = <triton._C.libtriton.ir.context object at 0x7feb2b2f4f70>
2025-05-07T20:32:11.9983153Z 
2025-05-07T20:32:11.9983328Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.9983842Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.9984309Z                            module_map=module_map)
2025-05-07T20:32:11.9984683Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.9985034Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.9985289Z E       ^
2025-05-07T20:32:11.9985754Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.9986201Z 
2025-05-07T20:32:11.9986620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.9987129Z 
2025-05-07T20:32:11.9987232Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.9987649Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.9988099Z     T=1,
2025-05-07T20:32:11.9988322Z     D=7168,
2025-05-07T20:32:11.9988556Z     scale_ub=1200.0,
2025-05-07T20:32:11.9988832Z     contiguous=True,
2025-05-07T20:32:11.9989108Z     compiled=True,
2025-05-07T20:32:11.9989364Z )
2025-05-07T20:32:11.9989766Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.9990379Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:11.9990637Z 
2025-05-07T20:32:11.9990713Z     @given(
2025-05-07T20:32:11.9990941Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.9991250Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.9991547Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.9991874Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.9992199Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.9992487Z     )
2025-05-07T20:32:11.9992829Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.9993264Z     def test_silu_mul_quant(
2025-05-07T20:32:11.9993501Z         self,
2025-05-07T20:32:11.9993688Z         T: int,
2025-05-07T20:32:11.9993882Z         D: int,
2025-05-07T20:32:11.9994107Z         scale_ub: Optional[float],
2025-05-07T20:32:11.9994372Z         contiguous: bool,
2025-05-07T20:32:11.9994615Z         compiled: bool,
2025-05-07T20:32:11.9994840Z     ) -> None:
2025-05-07T20:32:11.9995049Z         torch.manual_seed(2025)
2025-05-07T20:32:11.9995290Z     
2025-05-07T20:32:11.9995565Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.9995900Z     
2025-05-07T20:32:11.9996092Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.9996382Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.9996686Z         x = x_sign * x_clamp
2025-05-07T20:32:11.9996926Z         x0 = x[:, :D]
2025-05-07T20:32:11.9997223Z         x1 = x[:, D:]
2025-05-07T20:32:11.9997438Z     
2025-05-07T20:32:11.9997620Z         if contiguous:
2025-05-07T20:32:11.9997848Z             x0 = x0.contiguous()
2025-05-07T20:32:11.9998104Z             x1 = x1.contiguous()
2025-05-07T20:32:11.9998340Z     
2025-05-07T20:32:11.9998527Z         if scale_ub is not None:
2025-05-07T20:32:11.9998873Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.9999204Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.9999511Z             )
2025-05-07T20:32:11.9999705Z         else:
2025-05-07T20:32:11.9999910Z             scale_ub_tensor = None
2025-05-07T20:32:12.0000160Z     
2025-05-07T20:32:12.0000387Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.0000695Z             op = silu_mul_quant
2025-05-07T20:32:12.0000943Z             if compiled:
2025-05-07T20:32:12.0001189Z                 op = torch.compile(op)
2025-05-07T20:32:12.0001486Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.0001758Z     
2025-05-07T20:32:12.0001956Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.0002119Z 
2025-05-07T20:32:12.0002222Z moe/activation_test.py:117: 
2025-05-07T20:32:12.0002508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.0002847Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.0003130Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.0003680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.0004402Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.0005058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.0005742Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.0006270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.0006955Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.0007612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.0008133Z     kernel = self.compile(
2025-05-07T20:32:12.0008723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.0009369Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.0009763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.0009989Z 
2025-05-07T20:32:12.0010196Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2b2d4d90>
2025-05-07T20:32:12.0011279Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.0012652Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2afb7820>}
2025-05-07T20:32:12.0013999Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.0015016Z context = <triton._C.libtriton.ir.context object at 0x7feb2b2917f0>
2025-05-07T20:32:12.0015307Z 
2025-05-07T20:32:12.0015471Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.0015996Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.0016464Z                            module_map=module_map)
2025-05-07T20:32:12.0016820Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.0017304Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.0017567Z E       ^
2025-05-07T20:32:12.0018028Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.0018487Z 
2025-05-07T20:32:12.0018951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.0019595Z 
2025-05-07T20:32:12.0019699Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.0020108Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.0020508Z     T=1,
2025-05-07T20:32:12.0020689Z     D=7168,
2025-05-07T20:32:12.0020882Z     scale_ub=1200.0,
2025-05-07T20:32:12.0021106Z     contiguous=False,
2025-05-07T20:32:12.0021333Z     compiled=True,
2025-05-07T20:32:12.0021533Z )
2025-05-07T20:32:12.1680690Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.1681262Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:12.1681633Z 
2025-05-07T20:32:12.1681712Z     @given(
2025-05-07T20:32:12.1681941Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.1682250Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.1682546Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.1682879Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.1683206Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.1683487Z     )
2025-05-07T20:32:12.1683831Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.1684269Z     def test_silu_mul_quant(
2025-05-07T20:32:12.1684505Z         self,
2025-05-07T20:32:12.1684697Z         T: int,
2025-05-07T20:32:12.1684892Z         D: int,
2025-05-07T20:32:12.1685105Z         scale_ub: Optional[float],
2025-05-07T20:32:12.1685371Z         contiguous: bool,
2025-05-07T20:32:12.1685615Z         compiled: bool,
2025-05-07T20:32:12.1685831Z     ) -> None:
2025-05-07T20:32:12.1686049Z         torch.manual_seed(2025)
2025-05-07T20:32:12.1686286Z     
2025-05-07T20:32:12.1686555Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.1686891Z     
2025-05-07T20:32:12.1687083Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.1687378Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.1687680Z         x = x_sign * x_clamp
2025-05-07T20:32:12.1687917Z         x0 = x[:, :D]
2025-05-07T20:32:12.1688132Z         x1 = x[:, D:]
2025-05-07T20:32:12.1688334Z     
2025-05-07T20:32:12.1688522Z         if contiguous:
2025-05-07T20:32:12.1688753Z             x0 = x0.contiguous()
2025-05-07T20:32:12.1689007Z             x1 = x1.contiguous()
2025-05-07T20:32:12.1689249Z     
2025-05-07T20:32:12.1689443Z         if scale_ub is not None:
2025-05-07T20:32:12.1689710Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.1690053Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.1690383Z             )
2025-05-07T20:32:12.1690576Z         else:
2025-05-07T20:32:12.1690784Z             scale_ub_tensor = None
2025-05-07T20:32:12.1691032Z     
2025-05-07T20:32:12.1691264Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.1691580Z             op = silu_mul_quant
2025-05-07T20:32:12.1691832Z             if compiled:
2025-05-07T20:32:12.1692078Z                 op = torch.compile(op)
2025-05-07T20:32:12.1692369Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.1692639Z     
2025-05-07T20:32:12.1692832Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.1692995Z 
2025-05-07T20:32:12.1693095Z moe/activation_test.py:117: 
2025-05-07T20:32:12.1693384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.1693712Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.1693999Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.1694707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.1695265Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.1695916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.1696706Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.1697238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.1697909Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.1698570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.1699093Z     kernel = self.compile(
2025-05-07T20:32:12.1699628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.1700280Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.1700670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.1700896Z 
2025-05-07T20:32:12.1701100Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2b2afb50>
2025-05-07T20:32:12.1702182Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.1703562Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2b391310>}
2025-05-07T20:32:12.1705086Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.1706109Z context = <triton._C.libtriton.ir.context object at 0x7feb2b3be7b0>
2025-05-07T20:32:12.1706398Z 
2025-05-07T20:32:12.1706566Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.1707094Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.1707565Z                            module_map=module_map)
2025-05-07T20:32:12.1707925Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.1708275Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.1708534Z E       ^
2025-05-07T20:32:12.1708987Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.1709441Z 
2025-05-07T20:32:12.1709922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.1710429Z 
2025-05-07T20:32:12.1710535Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.1710940Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.1711331Z     T=1,
2025-05-07T20:32:12.1711515Z     D=7168,
2025-05-07T20:32:12.1711707Z     scale_ub=None,
2025-05-07T20:32:12.1711916Z     contiguous=False,
2025-05-07T20:32:12.1712137Z     compiled=True,
2025-05-07T20:32:12.1712340Z )
2025-05-07T20:32:12.4596093Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4596593Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:12.4596882Z 
2025-05-07T20:32:12.4596966Z     @given(
2025-05-07T20:32:12.4597196Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4597509Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4597813Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4598139Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4598617Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4598950Z     )
2025-05-07T20:32:12.4599297Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4599733Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4599975Z         self,
2025-05-07T20:32:12.4600166Z         T: int,
2025-05-07T20:32:12.4600477Z         D: int,
2025-05-07T20:32:12.4600697Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4600962Z         contiguous: bool,
2025-05-07T20:32:12.4601202Z         compiled: bool,
2025-05-07T20:32:12.4601426Z     ) -> None:
2025-05-07T20:32:12.4601639Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4601876Z     
2025-05-07T20:32:12.4602145Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4602477Z     
2025-05-07T20:32:12.4602666Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4602953Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4603258Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4603501Z         x0 = x[:, :D]
2025-05-07T20:32:12.4603860Z         x1 = x[:, D:]
2025-05-07T20:32:12.4604066Z     
2025-05-07T20:32:12.4604251Z         if contiguous:
2025-05-07T20:32:12.4604486Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4604744Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4604990Z     
2025-05-07T20:32:12.4605181Z         if scale_ub is not None:
2025-05-07T20:32:12.4605452Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4605782Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4606088Z             )
2025-05-07T20:32:12.4606279Z         else:
2025-05-07T20:32:12.4606484Z             scale_ub_tensor = None
2025-05-07T20:32:12.4606736Z     
2025-05-07T20:32:12.4606967Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4607274Z             op = silu_mul_quant
2025-05-07T20:32:12.4607525Z             if compiled:
2025-05-07T20:32:12.4607772Z                 op = torch.compile(op)
2025-05-07T20:32:12.4608069Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4608344Z     
2025-05-07T20:32:12.4608538Z         y_fp8, y_scale = fn()
2025-05-07T20:32:12.4608820Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:12.4609109Z     
2025-05-07T20:32:12.4609347Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4609687Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:12.4609972Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:12.4610284Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:12.4610644Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.4610948Z     
2025-05-07T20:32:12.4611147Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:12.4611340Z 
2025-05-07T20:32:12.4611445Z moe/activation_test.py:126: 
2025-05-07T20:32:12.4611736Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4612075Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:12.4612401Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.4613186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:12.4613943Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:12.4614482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4615157Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4615833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:12.4616550Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.4617412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:12.4618159Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.4618929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:12.4619562Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:12.4620271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:12.4620790Z     fn()
2025-05-07T20:32:12.4621286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:12.4621861Z     self.fn.run(
2025-05-07T20:32:12.4622326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4622847Z     kernel = self.compile(
2025-05-07T20:32:12.4623389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4624039Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4624434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4624659Z 
2025-05-07T20:32:12.4624865Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2b1645b0>
2025-05-07T20:32:12.4625954Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4627325Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2b23b040>}
2025-05-07T20:32:12.4628674Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4629750Z context = <triton._C.libtriton.ir.context object at 0x7feb2b237bb0>
2025-05-07T20:32:12.4630110Z 
2025-05-07T20:32:12.4630276Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4630802Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4631265Z                            module_map=module_map)
2025-05-07T20:32:12.4631621Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4631977Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:12.4632244Z E       ^
2025-05-07T20:32:12.4632708Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4633162Z 
2025-05-07T20:32:12.4633577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4634085Z 
2025-05-07T20:32:12.4634188Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4634604Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4635006Z     T=1,
2025-05-07T20:32:12.4635190Z     D=5120,
2025-05-07T20:32:12.4635388Z     scale_ub=1200.0,
2025-05-07T20:32:12.4635618Z     contiguous=False,
2025-05-07T20:32:12.4635842Z     compiled=True,
2025-05-07T20:32:12.4636046Z )
2025-05-07T20:32:12.6647838Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.6648350Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
﻿2025-05-07T20:32:12.6652392Z 
2025-05-07T20:32:12.6652475Z     @given(
2025-05-07T20:32:12.6652707Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.6653013Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.6653320Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.6653792Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.6654121Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.6654411Z     )
2025-05-07T20:32:12.6654764Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.6655210Z     def test_silu_mul_quant(
2025-05-07T20:32:12.6655509Z         self,
2025-05-07T20:32:12.6655706Z         T: int,
2025-05-07T20:32:12.6655906Z         D: int,
2025-05-07T20:32:12.6656119Z         scale_ub: Optional[float],
2025-05-07T20:32:12.6656390Z         contiguous: bool,
2025-05-07T20:32:12.6656643Z         compiled: bool,
2025-05-07T20:32:12.6656864Z     ) -> None:
2025-05-07T20:32:12.6657081Z         torch.manual_seed(2025)
2025-05-07T20:32:12.6657341Z     
2025-05-07T20:32:12.6664071Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.6664417Z     
2025-05-07T20:32:12.6664613Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.6664914Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.6665221Z         x = x_sign * x_clamp
2025-05-07T20:32:12.6665465Z         x0 = x[:, :D]
2025-05-07T20:32:12.6665683Z         x1 = x[:, D:]
2025-05-07T20:32:12.6665893Z     
2025-05-07T20:32:12.6666081Z         if contiguous:
2025-05-07T20:32:12.6666344Z             x0 = x0.contiguous()
2025-05-07T20:32:12.6666606Z             x1 = x1.contiguous()
2025-05-07T20:32:12.6666846Z     
2025-05-07T20:32:12.6667035Z         if scale_ub is not None:
2025-05-07T20:32:12.6667313Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.6667658Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.6667961Z             )
2025-05-07T20:32:12.6668156Z         else:
2025-05-07T20:32:12.6668366Z             scale_ub_tensor = None
2025-05-07T20:32:12.6668608Z     
2025-05-07T20:32:12.6668842Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.6669155Z             op = silu_mul_quant
2025-05-07T20:32:12.6669404Z             if compiled:
2025-05-07T20:32:12.6669653Z                 op = torch.compile(op)
2025-05-07T20:32:12.6670028Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.6670300Z     
2025-05-07T20:32:12.6670493Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.6670661Z 
2025-05-07T20:32:12.6670762Z moe/activation_test.py:117: 
2025-05-07T20:32:12.6671061Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.6671384Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.6671665Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.6672227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.6672780Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.6673438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.6674126Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.6674661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.6675336Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.6675997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.6676533Z     kernel = self.compile(
2025-05-07T20:32:12.6677064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.6677710Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.6678101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.6678445Z 
2025-05-07T20:32:12.6678654Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2b2a71f0>
2025-05-07T20:32:12.6679804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.6681176Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2b23bf70>}
2025-05-07T20:32:12.6682554Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.6683578Z context = <triton._C.libtriton.ir.context object at 0x7feb2b1d7eb0>
2025-05-07T20:32:12.6683868Z 
2025-05-07T20:32:12.6684033Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.6684546Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.6685013Z                            module_map=module_map)
2025-05-07T20:32:12.6685380Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.6685728Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.6685995Z E       ^
2025-05-07T20:32:12.6686460Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.6686910Z 
2025-05-07T20:32:12.6687326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.6687832Z 
2025-05-07T20:32:12.6687933Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.6688340Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.6688769Z     T=1,
2025-05-07T20:32:12.6688972Z     D=5120,
2025-05-07T20:32:12.6689165Z     scale_ub=1200.0,
2025-05-07T20:32:12.6689390Z     contiguous=False,
2025-05-07T20:32:12.6689617Z     compiled=False,
2025-05-07T20:32:12.6689818Z )
2025-05-07T20:32:12.6690141Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.6690626Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:12.6690894Z 
2025-05-07T20:32:12.6690970Z     @given(
2025-05-07T20:32:12.6691202Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.6691512Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.6691825Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.6692149Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.6692479Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.6692767Z     )
2025-05-07T20:32:12.6693116Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.6693551Z     def test_silu_mul_quant(
2025-05-07T20:32:12.6693793Z         self,
2025-05-07T20:32:12.6693980Z         T: int,
2025-05-07T20:32:12.6694177Z         D: int,
2025-05-07T20:32:12.6694403Z         scale_ub: Optional[float],
2025-05-07T20:32:12.6694670Z         contiguous: bool,
2025-05-07T20:32:12.6694911Z         compiled: bool,
2025-05-07T20:32:12.6695130Z     ) -> None:
2025-05-07T20:32:12.6695342Z         torch.manual_seed(2025)
2025-05-07T20:32:12.6695583Z     
2025-05-07T20:32:12.6695862Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.6696205Z     
2025-05-07T20:32:12.6696392Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.6696682Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.6696991Z         x = x_sign * x_clamp
2025-05-07T20:32:12.6697229Z         x0 = x[:, :D]
2025-05-07T20:32:12.6697448Z         x1 = x[:, D:]
2025-05-07T20:32:12.6697724Z     
2025-05-07T20:32:12.6697906Z         if contiguous:
2025-05-07T20:32:12.6698137Z             x0 = x0.contiguous()
2025-05-07T20:32:12.6698396Z             x1 = x1.contiguous()
2025-05-07T20:32:12.6698628Z     
2025-05-07T20:32:12.6698894Z         if scale_ub is not None:
2025-05-07T20:32:12.6699189Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.6699546Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.6699850Z             )
2025-05-07T20:32:12.6700041Z         else:
2025-05-07T20:32:12.6700246Z             scale_ub_tensor = None
2025-05-07T20:32:12.6700538Z     
2025-05-07T20:32:12.6700766Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.6701077Z             op = silu_mul_quant
2025-05-07T20:32:12.6701330Z             if compiled:
2025-05-07T20:32:12.6701576Z                 op = torch.compile(op)
2025-05-07T20:32:12.6701874Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.6702147Z     
2025-05-07T20:32:12.6702342Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.6702506Z 
2025-05-07T20:32:12.6702611Z moe/activation_test.py:117: 
2025-05-07T20:32:12.6702896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.6703227Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.6703512Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.6704495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.6705186Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.6705726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.6706397Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.6707042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.6707569Z     kernel = self.compile(
2025-05-07T20:32:12.6708099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.6708738Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.6709130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.6709358Z 
2025-05-07T20:32:12.6709563Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2ac5c0a0>
2025-05-07T20:32:12.6710696Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.6712068Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2ac483a0>}
2025-05-07T20:32:12.6713403Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.6714423Z context = <triton._C.libtriton.ir.context object at 0x7feb2ac1a3b0>
2025-05-07T20:32:12.6714710Z 
2025-05-07T20:32:12.6714872Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.6715387Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.6715850Z                            module_map=module_map)
2025-05-07T20:32:12.6716209Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.6716561Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.6716815Z E       ^
2025-05-07T20:32:12.6717280Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.6717833Z 
2025-05-07T20:32:12.6718244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.6718803Z 
2025-05-07T20:32:12.6718906Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.6719425Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.6719831Z     T=16384,
2025-05-07T20:32:12.6720024Z     D=5120,
2025-05-07T20:32:12.6720205Z     scale_ub=1200.0,
2025-05-07T20:32:12.6720426Z     contiguous=False,
2025-05-07T20:32:12.6720652Z     compiled=True,
2025-05-07T20:32:12.6720911Z )
2025-05-07T20:32:12.7903360Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.7904014Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:12.7904298Z 
2025-05-07T20:32:12.7904412Z     @given(
2025-05-07T20:32:12.7904688Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.7905126Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.7905435Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.7905760Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.7906084Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.7906375Z     )
2025-05-07T20:32:12.7906723Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.7907161Z     def test_silu_mul_quant(
2025-05-07T20:32:12.7907397Z         self,
2025-05-07T20:32:12.7907587Z         T: int,
2025-05-07T20:32:12.7907776Z         D: int,
2025-05-07T20:32:12.7907990Z         scale_ub: Optional[float],
2025-05-07T20:32:12.7908255Z         contiguous: bool,
2025-05-07T20:32:12.7908483Z         compiled: bool,
2025-05-07T20:32:12.7908697Z     ) -> None:
2025-05-07T20:32:12.7908909Z         torch.manual_seed(2025)
2025-05-07T20:32:12.7909145Z     
2025-05-07T20:32:12.7909403Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.7909738Z     
2025-05-07T20:32:12.7909997Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.7910275Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.7910576Z         x = x_sign * x_clamp
2025-05-07T20:32:12.7910821Z         x0 = x[:, :D]
2025-05-07T20:32:12.7911027Z         x1 = x[:, D:]
2025-05-07T20:32:12.7911228Z     
2025-05-07T20:32:12.7911403Z         if contiguous:
2025-05-07T20:32:12.7911620Z             x0 = x0.contiguous()
2025-05-07T20:32:12.7911880Z             x1 = x1.contiguous()
2025-05-07T20:32:12.7912118Z     
2025-05-07T20:32:12.7912311Z         if scale_ub is not None:
2025-05-07T20:32:12.7912573Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.7912906Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.7913211Z             )
2025-05-07T20:32:12.7913390Z         else:
2025-05-07T20:32:12.7913597Z             scale_ub_tensor = None
2025-05-07T20:32:12.7913841Z     
2025-05-07T20:32:12.7914067Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.7914377Z             op = silu_mul_quant
2025-05-07T20:32:12.7914626Z             if compiled:
2025-05-07T20:32:12.7914865Z                 op = torch.compile(op)
2025-05-07T20:32:12.7915160Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.7915432Z     
2025-05-07T20:32:12.7915616Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.7915783Z 
2025-05-07T20:32:12.7915880Z moe/activation_test.py:117: 
2025-05-07T20:32:12.7916172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.7916501Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.7916770Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.7917322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.7917877Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.7918538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.7919388Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.7919915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.7920736Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.7921388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.7921911Z     kernel = self.compile(
2025-05-07T20:32:12.7922501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.7923141Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.7923528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.7923755Z 
2025-05-07T20:32:12.7923960Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2ac67850>
2025-05-07T20:32:12.7925045Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.7926417Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2abfa0d0>}
2025-05-07T20:32:12.7927749Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.7928771Z context = <triton._C.libtriton.ir.context object at 0x7feb2abf9430>
2025-05-07T20:32:12.7929111Z 
2025-05-07T20:32:12.7929273Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.7929788Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.7930241Z                            module_map=module_map)
2025-05-07T20:32:12.7930607Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.7930954Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.7931200Z E       ^
2025-05-07T20:32:12.7931661Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.7932108Z 
2025-05-07T20:32:12.7932521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.7933033Z 
2025-05-07T20:32:12.7933136Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.7933536Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.7933940Z     T=2048,
2025-05-07T20:32:12.7934118Z     D=7168,
2025-05-07T20:32:12.7934304Z     scale_ub=1200.0,
2025-05-07T20:32:12.7934521Z     contiguous=False,
2025-05-07T20:32:12.7934737Z     compiled=True,
2025-05-07T20:32:12.7934928Z )
2025-05-07T20:32:12.7935243Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.7935748Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:12.7936015Z 
2025-05-07T20:32:12.7936091Z     @given(
2025-05-07T20:32:12.7936309Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.7936614Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.7936912Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.7937235Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.7937565Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.7937844Z     )
2025-05-07T20:32:12.7938182Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.7938614Z     def test_silu_mul_quant(
2025-05-07T20:32:12.7938911Z         self,
2025-05-07T20:32:12.7939090Z         T: int,
2025-05-07T20:32:12.7939280Z         D: int,
2025-05-07T20:32:12.7939493Z         scale_ub: Optional[float],
2025-05-07T20:32:12.7939765Z         contiguous: bool,
2025-05-07T20:32:12.7940066Z         compiled: bool,
2025-05-07T20:32:12.7940282Z     ) -> None:
2025-05-07T20:32:12.7940489Z         torch.manual_seed(2025)
2025-05-07T20:32:12.7940729Z     
2025-05-07T20:32:12.7940997Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.7941355Z     
2025-05-07T20:32:12.7941546Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.7941863Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.7942166Z         x = x_sign * x_clamp
2025-05-07T20:32:12.7942394Z         x0 = x[:, :D]
2025-05-07T20:32:12.7942602Z         x1 = x[:, D:]
2025-05-07T20:32:12.7942803Z     
2025-05-07T20:32:12.7942976Z         if contiguous:
2025-05-07T20:32:12.7943199Z             x0 = x0.contiguous()
2025-05-07T20:32:12.7943456Z             x1 = x1.contiguous()
2025-05-07T20:32:12.7943694Z     
2025-05-07T20:32:12.7943880Z         if scale_ub is not None:
2025-05-07T20:32:12.7944152Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.7944478Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.7944777Z             )
2025-05-07T20:32:12.7944969Z         else:
2025-05-07T20:32:12.7945170Z             scale_ub_tensor = None
2025-05-07T20:32:12.7945412Z     
2025-05-07T20:32:12.7945635Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.7945946Z             op = silu_mul_quant
2025-05-07T20:32:12.7946182Z             if compiled:
2025-05-07T20:32:12.7946423Z                 op = torch.compile(op)
2025-05-07T20:32:12.7946720Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.7946986Z     
2025-05-07T20:32:12.7947181Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.7947340Z 
2025-05-07T20:32:12.7947449Z moe/activation_test.py:117: 
2025-05-07T20:32:12.7947740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.7948068Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.7948350Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.7948945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.7949488Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.7950195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.7950877Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.7951398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.7952069Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.7952719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.7953243Z     kernel = self.compile(
2025-05-07T20:32:12.7953769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.7954415Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.7954804Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.7955027Z 
2025-05-07T20:32:12.7955234Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2b24da90>
2025-05-07T20:32:12.7956312Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.7957684Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2abfaca0>}
2025-05-07T20:32:12.7959201Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.7960222Z context = <triton._C.libtriton.ir.context object at 0x7feb2b133230>
2025-05-07T20:32:12.7960504Z 
2025-05-07T20:32:12.7960667Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.7961181Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.7961682Z                            module_map=module_map)
2025-05-07T20:32:12.7962041Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.7962383Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.7962639Z E       ^
2025-05-07T20:32:12.7963099Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.7963551Z 
2025-05-07T20:32:12.7963961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.7964471Z 
2025-05-07T20:32:13.0657036Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.0657463Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.0657892Z     T=1,
2025-05-07T20:32:13.0658085Z     D=5120,
2025-05-07T20:32:13.0658279Z     scale_ub=None,
2025-05-07T20:32:13.0658491Z     contiguous=False,
2025-05-07T20:32:13.0658729Z     compiled=False,
2025-05-07T20:32:13.0658939Z )
2025-05-07T20:32:13.0659301Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.0659785Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:13.0660046Z 
2025-05-07T20:32:13.0660133Z     @given(
2025-05-07T20:32:13.0660359Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.0660671Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.0660970Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.0661294Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.0661615Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.0661894Z     )
2025-05-07T20:32:13.0662232Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.0662662Z     def test_silu_mul_quant(
2025-05-07T20:32:13.0662896Z         self,
2025-05-07T20:32:13.0663086Z         T: int,
2025-05-07T20:32:13.0663274Z         D: int,
2025-05-07T20:32:13.0663490Z         scale_ub: Optional[float],
2025-05-07T20:32:13.0663755Z         contiguous: bool,
2025-05-07T20:32:13.0663983Z         compiled: bool,
2025-05-07T20:32:13.0664200Z     ) -> None:
2025-05-07T20:32:13.0664416Z         torch.manual_seed(2025)
2025-05-07T20:32:13.0664648Z     
2025-05-07T20:32:13.0664914Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.0665249Z     
2025-05-07T20:32:13.0665438Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.0665723Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.0666031Z         x = x_sign * x_clamp
2025-05-07T20:32:13.0666266Z         x0 = x[:, :D]
2025-05-07T20:32:13.0666475Z         x1 = x[:, D:]
2025-05-07T20:32:13.0666679Z     
2025-05-07T20:32:13.0666862Z         if contiguous:
2025-05-07T20:32:13.0667090Z             x0 = x0.contiguous()
2025-05-07T20:32:13.0667342Z             x1 = x1.contiguous()
2025-05-07T20:32:13.0667576Z     
2025-05-07T20:32:13.0667760Z         if scale_ub is not None:
2025-05-07T20:32:13.0668026Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.0668360Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.0668660Z             )
2025-05-07T20:32:13.0668854Z         else:
2025-05-07T20:32:13.0669070Z             scale_ub_tensor = None
2025-05-07T20:32:13.0669428Z     
2025-05-07T20:32:13.0669655Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.0670035Z             op = silu_mul_quant
2025-05-07T20:32:13.0670279Z             if compiled:
2025-05-07T20:32:13.0670682Z                 op = torch.compile(op)
2025-05-07T20:32:13.0670985Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.0671252Z     
2025-05-07T20:32:13.0671433Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.0671598Z 
2025-05-07T20:32:13.0671695Z moe/activation_test.py:117: 
2025-05-07T20:32:13.0671984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.0672364Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.0672642Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.0673326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.0674004Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.0674532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.0675215Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.0675870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.0676394Z     kernel = self.compile(
2025-05-07T20:32:13.0676926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.0677577Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.0684060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.0684304Z 
2025-05-07T20:32:13.0684523Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2aecb790>
2025-05-07T20:32:13.0685623Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.0687016Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2b178670>}
2025-05-07T20:32:13.0688358Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.0689381Z context = <triton._C.libtriton.ir.context object at 0x7feb2b31dab0>
2025-05-07T20:32:13.0689666Z 
2025-05-07T20:32:13.0689835Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.0690349Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.0690811Z                            module_map=module_map)
2025-05-07T20:32:13.0691171Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.0691516Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.0691760Z E       ^
2025-05-07T20:32:13.0692229Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.0692681Z 
2025-05-07T20:32:13.0693098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.0693609Z 
2025-05-07T20:32:13.0693716Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.0694141Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.0694543Z     T=4096,
2025-05-07T20:32:13.0694727Z     D=7168,
2025-05-07T20:32:13.0694914Z     scale_ub=1200.0,
2025-05-07T20:32:13.0695129Z     contiguous=False,
2025-05-07T20:32:13.0695346Z     compiled=False,
2025-05-07T20:32:13.0695629Z )
2025-05-07T20:32:13.0695940Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.0696430Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:13.0696703Z 
2025-05-07T20:32:13.0696782Z     @given(
2025-05-07T20:32:13.0697078Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.0697385Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.0697686Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.0698001Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.0698373Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.0698654Z     )
2025-05-07T20:32:13.0698995Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.0699423Z     def test_silu_mul_quant(
2025-05-07T20:32:13.0699654Z         self,
2025-05-07T20:32:13.0699840Z         T: int,
2025-05-07T20:32:13.0700023Z         D: int,
2025-05-07T20:32:13.0700237Z         scale_ub: Optional[float],
2025-05-07T20:32:13.0700502Z         contiguous: bool,
2025-05-07T20:32:13.0700729Z         compiled: bool,
2025-05-07T20:32:13.0700941Z     ) -> None:
2025-05-07T20:32:13.0701151Z         torch.manual_seed(2025)
2025-05-07T20:32:13.0701386Z     
2025-05-07T20:32:13.0701648Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.0701980Z     
2025-05-07T20:32:13.0702161Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.0702440Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.0702739Z         x = x_sign * x_clamp
2025-05-07T20:32:13.0702977Z         x0 = x[:, :D]
2025-05-07T20:32:13.0703187Z         x1 = x[:, D:]
2025-05-07T20:32:13.0703385Z     
2025-05-07T20:32:13.0703568Z         if contiguous:
2025-05-07T20:32:13.0704055Z             x0 = x0.contiguous()
2025-05-07T20:32:13.0704308Z             x1 = x1.contiguous()
2025-05-07T20:32:13.0704537Z     
2025-05-07T20:32:13.0704718Z         if scale_ub is not None:
2025-05-07T20:32:13.0704979Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.0705308Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.0705604Z             )
2025-05-07T20:32:13.0705792Z         else:
2025-05-07T20:32:13.0706002Z             scale_ub_tensor = None
2025-05-07T20:32:13.0706246Z     
2025-05-07T20:32:13.0706469Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.0706775Z             op = silu_mul_quant
2025-05-07T20:32:13.0707018Z             if compiled:
2025-05-07T20:32:13.0707253Z                 op = torch.compile(op)
2025-05-07T20:32:13.0707547Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.0707809Z     
2025-05-07T20:32:13.0707988Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.0708153Z 
2025-05-07T20:32:13.0708249Z moe/activation_test.py:117: 
2025-05-07T20:32:13.0708537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.0708884Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.0709189Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.0709923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.0710615Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.0711135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.0711806Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.0712462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.0712975Z     kernel = self.compile(
2025-05-07T20:32:13.0713502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.0714143Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.0714623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.0714848Z 
2025-05-07T20:32:13.0715053Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2b3374f0>
2025-05-07T20:32:13.0716245Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.0717615Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2b086040>}
2025-05-07T20:32:13.0719060Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.0720076Z context = <triton._C.libtriton.ir.context object at 0x7feb2b0516f0>
2025-05-07T20:32:13.0720365Z 
2025-05-07T20:32:13.0720526Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.0721044Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.0721501Z                            module_map=module_map)
2025-05-07T20:32:13.0721857Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.0722201Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.0722452Z E       ^
2025-05-07T20:32:13.0722913Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.0723363Z 
2025-05-07T20:32:13.0723774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.0724282Z 
2025-05-07T20:32:13.0724383Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.0724793Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.0725194Z     T=16384,
2025-05-07T20:32:13.0725373Z     D=7168,
2025-05-07T20:32:13.0725560Z     scale_ub=None,
2025-05-07T20:32:13.0725768Z     contiguous=True,
2025-05-07T20:32:13.0725982Z     compiled=True,
2025-05-07T20:32:13.0726179Z )
2025-05-07T20:32:13.3701602Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.3702337Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:13.3702730Z 
2025-05-07T20:32:13.3702840Z     @given(
2025-05-07T20:32:13.3703174Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.3703601Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.3704220Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.3704634Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.3704974Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.3705266Z     )
2025-05-07T20:32:13.3705627Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.3706080Z     def test_silu_mul_quant(
2025-05-07T20:32:13.3706326Z         self,
2025-05-07T20:32:13.3706532Z         T: int,
2025-05-07T20:32:13.3706751Z         D: int,
2025-05-07T20:32:13.3706980Z         scale_ub: Optional[float],
2025-05-07T20:32:13.3707252Z         contiguous: bool,
2025-05-07T20:32:13.3707500Z         compiled: bool,
2025-05-07T20:32:13.3707734Z     ) -> None:
2025-05-07T20:32:13.3707952Z         torch.manual_seed(2025)
2025-05-07T20:32:13.3708210Z     
2025-05-07T20:32:13.3708494Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.3708919Z     
2025-05-07T20:32:13.3709114Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.3709406Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.3709719Z         x = x_sign * x_clamp
2025-05-07T20:32:13.3710025Z         x0 = x[:, :D]
2025-05-07T20:32:13.3710387Z         x1 = x[:, D:]
2025-05-07T20:32:13.3710597Z     
2025-05-07T20:32:13.3710785Z         if contiguous:
2025-05-07T20:32:13.3711019Z             x0 = x0.contiguous()
2025-05-07T20:32:13.3711283Z             x1 = x1.contiguous()
2025-05-07T20:32:13.3711683Z     
2025-05-07T20:32:13.3711878Z         if scale_ub is not None:
2025-05-07T20:32:13.3712153Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.3712497Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.3712806Z             )
2025-05-07T20:32:13.3713001Z         else:
2025-05-07T20:32:13.3713292Z             scale_ub_tensor = None
2025-05-07T20:32:13.3713541Z     
2025-05-07T20:32:13.3713783Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.3714103Z             op = silu_mul_quant
2025-05-07T20:32:13.3714359Z             if compiled:
2025-05-07T20:32:13.3714603Z                 op = torch.compile(op)
2025-05-07T20:32:13.3714903Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.3715184Z     
2025-05-07T20:32:13.3715372Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.3715544Z 
2025-05-07T20:32:13.3715649Z moe/activation_test.py:117: 
2025-05-07T20:32:13.3715955Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.3716284Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.3716574Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.3717144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:13.3717711Z     return fn(*args, **kwargs)
2025-05-07T20:32:13.3718371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.3719109Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.3719659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.3720338Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.3721003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.3721551Z     kernel = self.compile(
2025-05-07T20:32:13.3722095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.3722744Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.3723147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.3723379Z 
2025-05-07T20:32:13.3723603Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2ae2b070>
2025-05-07T20:32:13.3724699Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.3726093Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2b086ca0>}
2025-05-07T20:32:13.3727443Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.3728478Z context = <triton._C.libtriton.ir.context object at 0x7feb2ad9ad30>
2025-05-07T20:32:13.3728767Z 
2025-05-07T20:32:13.3728943Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.3729468Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.3729943Z                            module_map=module_map)
2025-05-07T20:32:13.3730313Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.3730733Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.3730996Z E       ^
2025-05-07T20:32:13.3731470Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.3731927Z 
2025-05-07T20:32:13.3732424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.3732937Z 
2025-05-07T20:32:13.3733045Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.3733464Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.3733914Z     T=4096,
2025-05-07T20:32:13.3734105Z     D=5120,
2025-05-07T20:32:13.3734294Z     scale_ub=None,
2025-05-07T20:32:13.3734511Z     contiguous=False,
2025-05-07T20:32:13.3734742Z     compiled=True,
2025-05-07T20:32:13.3734946Z )
2025-05-07T20:32:13.3735269Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.3735774Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:13.3736052Z 
2025-05-07T20:32:13.3736129Z     @given(
2025-05-07T20:32:13.3736364Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.3736684Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.3736992Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.3737326Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.3737658Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.3737948Z     )
2025-05-07T20:32:13.3738294Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.3738750Z     def test_silu_mul_quant(
2025-05-07T20:32:13.3738998Z         self,
2025-05-07T20:32:13.3739193Z         T: int,
2025-05-07T20:32:13.3739396Z         D: int,
2025-05-07T20:32:13.3739616Z         scale_ub: Optional[float],
2025-05-07T20:32:13.3739887Z         contiguous: bool,
2025-05-07T20:32:13.3740136Z         compiled: bool,
2025-05-07T20:32:13.3740368Z     ) -> None:
2025-05-07T20:32:13.3740586Z         torch.manual_seed(2025)
2025-05-07T20:32:13.3740837Z     
2025-05-07T20:32:13.3741119Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.3741459Z     
2025-05-07T20:32:13.3741664Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.3741962Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.3742276Z         x = x_sign * x_clamp
2025-05-07T20:32:13.3742519Z         x0 = x[:, :D]
2025-05-07T20:32:13.3742745Z         x1 = x[:, D:]
2025-05-07T20:32:13.3742960Z     
2025-05-07T20:32:13.3743152Z         if contiguous:
2025-05-07T20:32:13.3743402Z             x0 = x0.contiguous()
2025-05-07T20:32:13.3743663Z             x1 = x1.contiguous()
2025-05-07T20:32:13.3743914Z     
2025-05-07T20:32:13.3744112Z         if scale_ub is not None:
2025-05-07T20:32:13.3744387Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.3744727Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.3745043Z             )
2025-05-07T20:32:13.3745238Z         else:
2025-05-07T20:32:13.3745457Z             scale_ub_tensor = None
2025-05-07T20:32:13.3745717Z     
2025-05-07T20:32:13.3745952Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.3746274Z             op = silu_mul_quant
2025-05-07T20:32:13.3746533Z             if compiled:
2025-05-07T20:32:13.3746782Z                 op = torch.compile(op)
2025-05-07T20:32:13.3747091Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.3747372Z     
2025-05-07T20:32:13.3747578Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.3747753Z 
2025-05-07T20:32:13.3747854Z moe/activation_test.py:117: 
2025-05-07T20:32:13.3748159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.3748493Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.3748776Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.3749400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:13.3750042Z     return fn(*args, **kwargs)
2025-05-07T20:32:13.3750773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.3751465Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.3752003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.3752681Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.3753376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.3753909Z     kernel = self.compile(
2025-05-07T20:32:13.3754448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.3755107Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.3755501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.3755736Z 
2025-05-07T20:32:13.3755950Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2af0d700>
2025-05-07T20:32:13.3757040Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.3758422Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2aec68b0>}
2025-05-07T20:32:13.3759769Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.3760799Z context = <triton._C.libtriton.ir.context object at 0x7feb2b1c2e70>
2025-05-07T20:32:13.3761092Z 
2025-05-07T20:32:13.3761261Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.3761793Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.3762260Z                            module_map=module_map)
2025-05-07T20:32:13.3762627Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.3762985Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.3763244Z E       ^
2025-05-07T20:32:13.3763724Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.3764189Z 
2025-05-07T20:32:13.3764608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.3765120Z 
2025-05-07T20:32:13.5715994Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.5716613Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.5717179Z     T=4096,
2025-05-07T20:32:13.5717477Z     D=5120,
2025-05-07T20:32:13.5717739Z     scale_ub=1200.0,
2025-05-07T20:32:13.5718061Z     contiguous=False,
2025-05-07T20:32:13.5718368Z     compiled=False,
2025-05-07T20:32:13.5718581Z )
2025-05-07T20:32:13.5718905Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.5719449Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:13.5719734Z 
2025-05-07T20:32:13.5719818Z     @given(
2025-05-07T20:32:13.5720043Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.5720357Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.5720673Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.5721014Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.5721554Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.5721850Z     )
2025-05-07T20:32:13.5722205Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.5722643Z     def test_silu_mul_quant(
2025-05-07T20:32:13.5723064Z         self,
2025-05-07T20:32:13.5723270Z         T: int,
2025-05-07T20:32:13.5723465Z         D: int,
2025-05-07T20:32:13.5723687Z         scale_ub: Optional[float],
2025-05-07T20:32:13.5723955Z         contiguous: bool,
2025-05-07T20:32:13.5724188Z         compiled: bool,
2025-05-07T20:32:13.5724420Z     ) -> None:
2025-05-07T20:32:13.5724709Z         torch.manual_seed(2025)
2025-05-07T20:32:13.5724944Z     
2025-05-07T20:32:13.5725216Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.5725558Z     
2025-05-07T20:32:13.5725750Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.5726040Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.5726354Z         x = x_sign * x_clamp
2025-05-07T20:32:13.5726596Z         x0 = x[:, :D]
2025-05-07T20:32:13.5726807Z         x1 = x[:, D:]
2025-05-07T20:32:13.5727015Z     
2025-05-07T20:32:13.5727204Z         if contiguous:
2025-05-07T20:32:13.5727431Z             x0 = x0.contiguous()
2025-05-07T20:32:13.5727699Z             x1 = x1.contiguous()
2025-05-07T20:32:13.5727939Z     
2025-05-07T20:32:13.5728123Z         if scale_ub is not None:
2025-05-07T20:32:13.5728400Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.5728737Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.5729037Z             )
2025-05-07T20:32:13.5729243Z         else:
2025-05-07T20:32:13.5729462Z             scale_ub_tensor = None
2025-05-07T20:32:13.5729702Z     
2025-05-07T20:32:13.5729935Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.5730254Z             op = silu_mul_quant
2025-05-07T20:32:13.5730493Z             if compiled:
2025-05-07T20:32:13.5730744Z                 op = torch.compile(op)
2025-05-07T20:32:13.5731046Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.5731325Z     
2025-05-07T20:32:13.5731513Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.5731684Z 
2025-05-07T20:32:13.5731785Z moe/activation_test.py:117: 
2025-05-07T20:32:13.5732092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.5732416Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.5732702Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.5733401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.5734094Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.5734638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.5735316Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.5735988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.5736516Z     kernel = self.compile(
2025-05-07T20:32:13.5737056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.5737707Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.5738109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.5738336Z 
2025-05-07T20:32:13.5738543Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2adb2f70>
2025-05-07T20:32:13.5739680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.5741074Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2afcd040>}
2025-05-07T20:32:13.5742546Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.5743572Z context = <triton._C.libtriton.ir.context object at 0x7feb2ad1df30>
2025-05-07T20:32:13.5743867Z 
2025-05-07T20:32:13.5744031Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.5744569Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.5745072Z                            module_map=module_map)
2025-05-07T20:32:13.5745443Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.5745803Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.5746055Z E       ^
2025-05-07T20:32:13.5746529Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.5756296Z 
2025-05-07T20:32:13.5756762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.5757294Z 
2025-05-07T20:32:13.5757406Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.5757824Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.5758242Z     T=4096,
2025-05-07T20:32:13.5758437Z     D=5120,
2025-05-07T20:32:13.5758639Z     scale_ub=1200.0,
2025-05-07T20:32:13.5758872Z     contiguous=False,
2025-05-07T20:32:13.5759091Z     compiled=True,
2025-05-07T20:32:13.5759304Z )
2025-05-07T20:32:13.5759632Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.5760133Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:13.5760408Z 
2025-05-07T20:32:13.5760483Z     @given(
2025-05-07T20:32:13.5760717Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.5761035Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.5761335Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.5761671Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.5762011Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.5762299Z     )
2025-05-07T20:32:13.5762658Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.5763103Z     def test_silu_mul_quant(
2025-05-07T20:32:13.5763351Z         self,
2025-05-07T20:32:13.5763540Z         T: int,
2025-05-07T20:32:13.5763736Z         D: int,
2025-05-07T20:32:13.5763955Z         scale_ub: Optional[float],
2025-05-07T20:32:13.5764220Z         contiguous: bool,
2025-05-07T20:32:13.5764459Z         compiled: bool,
2025-05-07T20:32:13.5764684Z     ) -> None:
2025-05-07T20:32:13.5764891Z         torch.manual_seed(2025)
2025-05-07T20:32:13.5765137Z     
2025-05-07T20:32:13.5765411Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.5765747Z     
2025-05-07T20:32:13.5765947Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.5766243Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.5766545Z         x = x_sign * x_clamp
2025-05-07T20:32:13.5766787Z         x0 = x[:, :D]
2025-05-07T20:32:13.5767003Z         x1 = x[:, D:]
2025-05-07T20:32:13.5767200Z     
2025-05-07T20:32:13.5767385Z         if contiguous:
2025-05-07T20:32:13.5767619Z             x0 = x0.contiguous()
2025-05-07T20:32:13.5767875Z             x1 = x1.contiguous()
2025-05-07T20:32:13.5768118Z     
2025-05-07T20:32:13.5768314Z         if scale_ub is not None:
2025-05-07T20:32:13.5768590Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.5768921Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.5769232Z             )
2025-05-07T20:32:13.5769428Z         else:
2025-05-07T20:32:13.5769724Z             scale_ub_tensor = None
2025-05-07T20:32:13.5769976Z     
2025-05-07T20:32:13.5770209Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.5770521Z             op = silu_mul_quant
2025-05-07T20:32:13.5770773Z             if compiled:
2025-05-07T20:32:13.5771104Z                 op = torch.compile(op)
2025-05-07T20:32:13.5771398Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.5771681Z     
2025-05-07T20:32:13.5771880Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.5772043Z 
2025-05-07T20:32:13.5772144Z moe/activation_test.py:117: 
2025-05-07T20:32:13.5772481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.5772820Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.5773104Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.5773661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:13.5774220Z     return fn(*args, **kwargs)
2025-05-07T20:32:13.5774893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.5775579Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.5776112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.5776796Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.5777456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.5777981Z     kernel = self.compile(
2025-05-07T20:32:13.5778516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.5779170Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.5779561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.5779790Z 
2025-05-07T20:32:13.5779995Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2adcd0d0>
2025-05-07T20:32:13.5781080Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.5782463Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2afcdee0>}
2025-05-07T20:32:13.5783809Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.5784832Z context = <triton._C.libtriton.ir.context object at 0x7feb2b000e70>
2025-05-07T20:32:13.5785120Z 
2025-05-07T20:32:13.5785281Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.5785804Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.5786275Z                            module_map=module_map)
2025-05-07T20:32:13.5786630Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.5786988Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.5787244Z E       ^
2025-05-07T20:32:13.5787714Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.5788167Z 
2025-05-07T20:32:13.5788580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.5789096Z 
2025-05-07T20:32:13.8576552Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.8577070Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.8577856Z     T=2048,
2025-05-07T20:32:13.8578117Z     D=7168,
2025-05-07T20:32:13.8578371Z     scale_ub=1200.0,
2025-05-07T20:32:13.8578665Z     contiguous=False,
2025-05-07T20:32:13.8578970Z     compiled=False,
2025-05-07T20:32:13.8579198Z )
2025-05-07T20:32:13.8579701Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.8580193Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:13.8580468Z 
2025-05-07T20:32:13.8580548Z     @given(
2025-05-07T20:32:13.8580769Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.8581137Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.8581441Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.8581763Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.8582084Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.8582367Z     )
2025-05-07T20:32:13.8582719Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.8583154Z     def test_silu_mul_quant(
2025-05-07T20:32:13.8583393Z         self,
2025-05-07T20:32:13.8583581Z         T: int,
2025-05-07T20:32:13.8583767Z         D: int,
2025-05-07T20:32:13.8583981Z         scale_ub: Optional[float],
2025-05-07T20:32:13.8584241Z         contiguous: bool,
2025-05-07T20:32:13.8584472Z         compiled: bool,
2025-05-07T20:32:13.8584697Z     ) -> None:
2025-05-07T20:32:13.8584913Z         torch.manual_seed(2025)
2025-05-07T20:32:13.8585147Z     
2025-05-07T20:32:13.8585405Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.8585743Z     
2025-05-07T20:32:13.8585921Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.8586201Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.8586501Z         x = x_sign * x_clamp
2025-05-07T20:32:13.8586730Z         x0 = x[:, :D]
2025-05-07T20:32:13.8586934Z         x1 = x[:, D:]
2025-05-07T20:32:13.8587143Z     
2025-05-07T20:32:13.8587320Z         if contiguous:
2025-05-07T20:32:13.8587542Z             x0 = x0.contiguous()
2025-05-07T20:32:13.8587799Z             x1 = x1.contiguous()
2025-05-07T20:32:13.8588039Z     
2025-05-07T20:32:13.8588221Z         if scale_ub is not None:
2025-05-07T20:32:13.8588496Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.8588832Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.8589129Z             )
2025-05-07T20:32:13.8589319Z         else:
2025-05-07T20:32:13.8589525Z             scale_ub_tensor = None
2025-05-07T20:32:13.8589768Z     
2025-05-07T20:32:13.8590077Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.8590387Z             op = silu_mul_quant
2025-05-07T20:32:13.8590633Z             if compiled:
2025-05-07T20:32:13.8590878Z                 op = torch.compile(op)
2025-05-07T20:32:13.8591169Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.8591436Z     
2025-05-07T20:32:13.8591622Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.8591789Z 
2025-05-07T20:32:13.8591889Z moe/activation_test.py:117: 
2025-05-07T20:32:13.8592180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.8592506Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.8592782Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.8593469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.8594146Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.8594693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.8595370Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.8596021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.8596629Z     kernel = self.compile(
2025-05-07T20:32:13.8597167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.8597818Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.8598291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.8598518Z 
2025-05-07T20:32:13.8598730Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2af3d7f0>
2025-05-07T20:32:13.8599865Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.8601285Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2ad4d550>}
2025-05-07T20:32:13.8602641Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.8603663Z context = <triton._C.libtriton.ir.context object at 0x7feb2ae520b0>
2025-05-07T20:32:13.8604153Z 
2025-05-07T20:32:13.8604318Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.8604839Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.8605308Z                            module_map=module_map)
2025-05-07T20:32:13.8605669Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.8606020Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.8606284Z E       ^
2025-05-07T20:32:13.8606748Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.8607205Z 
2025-05-07T20:32:13.8607624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.8608140Z 
2025-05-07T20:32:13.8608245Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.8608664Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.8609073Z     T=1,
2025-05-07T20:32:13.8609283Z     D=7168,
2025-05-07T20:32:13.8609481Z     scale_ub=None,
2025-05-07T20:32:13.8609691Z     contiguous=True,
2025-05-07T20:32:13.8609912Z     compiled=False,
2025-05-07T20:32:13.8610116Z )
2025-05-07T20:32:13.8610427Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.8610910Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:13.8611174Z 
2025-05-07T20:32:13.8611248Z     @given(
2025-05-07T20:32:13.8611479Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.8611784Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.8612095Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.8612424Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.8612745Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.8613031Z     )
2025-05-07T20:32:13.8613384Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.8613816Z     def test_silu_mul_quant(
2025-05-07T20:32:13.8614055Z         self,
2025-05-07T20:32:13.8614246Z         T: int,
2025-05-07T20:32:13.8614436Z         D: int,
2025-05-07T20:32:13.8614653Z         scale_ub: Optional[float],
2025-05-07T20:32:13.8614926Z         contiguous: bool,
2025-05-07T20:32:13.8615161Z         compiled: bool,
2025-05-07T20:32:13.8615376Z     ) -> None:
2025-05-07T20:32:13.8615589Z         torch.manual_seed(2025)
2025-05-07T20:32:13.8615827Z     
2025-05-07T20:32:13.8616091Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.8616513Z     
2025-05-07T20:32:13.8616705Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.8616985Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.8617292Z         x = x_sign * x_clamp
2025-05-07T20:32:13.8617533Z         x0 = x[:, :D]
2025-05-07T20:32:13.8617884Z         x1 = x[:, D:]
2025-05-07T20:32:13.8618092Z     
2025-05-07T20:32:13.8618275Z         if contiguous:
2025-05-07T20:32:13.8618498Z             x0 = x0.contiguous()
2025-05-07T20:32:13.8618754Z             x1 = x1.contiguous()
2025-05-07T20:32:13.8618992Z     
2025-05-07T20:32:13.8619179Z         if scale_ub is not None:
2025-05-07T20:32:13.8619509Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.8619843Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.8620147Z             )
2025-05-07T20:32:13.8620337Z         else:
2025-05-07T20:32:13.8620552Z             scale_ub_tensor = None
2025-05-07T20:32:13.8620800Z     
2025-05-07T20:32:13.8621025Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.8621342Z             op = silu_mul_quant
2025-05-07T20:32:13.8621592Z             if compiled:
2025-05-07T20:32:13.8621829Z                 op = torch.compile(op)
2025-05-07T20:32:13.8622125Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.8622401Z     
2025-05-07T20:32:13.8622588Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.8622755Z 
2025-05-07T20:32:13.8622851Z moe/activation_test.py:117: 
2025-05-07T20:32:13.8623137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.8623457Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.8623740Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.8624422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.8625109Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.8625636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.8626313Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.8626972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.8627497Z     kernel = self.compile(
2025-05-07T20:32:13.8628024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.8628675Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.8629114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.8629343Z 
2025-05-07T20:32:13.8629552Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2ae7b4f0>
2025-05-07T20:32:13.8630697Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.8632080Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2acad160>}
2025-05-07T20:32:13.8633433Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.8634459Z context = <triton._C.libtriton.ir.context object at 0x7feb2ac8e830>
2025-05-07T20:32:13.8634749Z 
2025-05-07T20:32:13.8634915Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.8635435Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.8635910Z                            module_map=module_map)
2025-05-07T20:32:13.8636270Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.8636674Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.8636931Z E       ^
2025-05-07T20:32:13.8637397Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.8637919Z 
2025-05-07T20:32:13.8638334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.8638850Z 
2025-05-07T20:32:13.8638957Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.8639425Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.8639866Z     T=16384,
2025-05-07T20:32:13.8640053Z     D=7168,
2025-05-07T20:32:13.8640245Z     scale_ub=1200.0,
2025-05-07T20:32:13.8640474Z     contiguous=False,
2025-05-07T20:32:13.8640694Z     compiled=True,
2025-05-07T20:32:13.8640897Z )
2025-05-07T20:32:14.0567433Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.0568794Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:14.0569174Z 
2025-05-07T20:32:14.0569279Z     @given(
2025-05-07T20:32:14.0569577Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.0569902Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.0570205Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.0570539Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.0570871Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.0571161Z     )
2025-05-07T20:32:14.0571509Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.0571951Z     def test_silu_mul_quant(
2025-05-07T20:32:14.0572194Z         self,
2025-05-07T20:32:14.0572385Z         T: int,
2025-05-07T20:32:14.0572590Z         D: int,
2025-05-07T20:32:14.0572805Z         scale_ub: Optional[float],
2025-05-07T20:32:14.0573074Z         contiguous: bool,
2025-05-07T20:32:14.0573316Z         compiled: bool,
2025-05-07T20:32:14.0573547Z     ) -> None:
2025-05-07T20:32:14.0573760Z         torch.manual_seed(2025)
2025-05-07T20:32:14.0574002Z     
2025-05-07T20:32:14.0574283Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.0574619Z     
2025-05-07T20:32:14.0574814Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.0575101Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.0575411Z         x = x_sign * x_clamp
2025-05-07T20:32:14.0575643Z         x0 = x[:, :D]
2025-05-07T20:32:14.0575861Z         x1 = x[:, D:]
2025-05-07T20:32:14.0576070Z     
2025-05-07T20:32:14.0576248Z         if contiguous:
2025-05-07T20:32:14.0576479Z             x0 = x0.contiguous()
2025-05-07T20:32:14.0576742Z             x1 = x1.contiguous()
2025-05-07T20:32:14.0576977Z     
2025-05-07T20:32:14.0577167Z         if scale_ub is not None:
2025-05-07T20:32:14.0577438Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.0577773Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.0578084Z             )
2025-05-07T20:32:14.0578276Z         else:
2025-05-07T20:32:14.0578484Z             scale_ub_tensor = None
2025-05-07T20:32:14.0578736Z     
2025-05-07T20:32:14.0579004Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.0579337Z             op = silu_mul_quant
2025-05-07T20:32:14.0579591Z             if compiled:
2025-05-07T20:32:14.0579844Z                 op = torch.compile(op)
2025-05-07T20:32:14.0580141Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.0580417Z     
2025-05-07T20:32:14.0580617Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.0580784Z 
2025-05-07T20:32:14.0580890Z moe/activation_test.py:117: 
2025-05-07T20:32:14.0581181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.0581513Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.0581798Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.0582479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.0583036Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.0583810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.0584505Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.0585037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.0585777Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.0586443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.0586969Z     kernel = self.compile(
2025-05-07T20:32:14.0587511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.0588169Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.0588561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.0588787Z 
2025-05-07T20:32:14.0589000Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2aaf8910>
2025-05-07T20:32:14.0590202Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.0591585Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2acaddc0>}
2025-05-07T20:32:14.0592935Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.0593957Z context = <triton._C.libtriton.ir.context object at 0x7feb2ab6c730>
2025-05-07T20:32:14.0594243Z 
2025-05-07T20:32:14.0594406Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.0594930Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.0595396Z                            module_map=module_map)
2025-05-07T20:32:14.0595754Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.0596107Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.0596371Z E       ^
2025-05-07T20:32:14.0596836Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.0597284Z 
2025-05-07T20:32:14.0597698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.0598216Z 
2025-05-07T20:32:14.0598321Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.0598736Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.0599128Z     T=1,
2025-05-07T20:32:14.0599312Z     D=7168,
2025-05-07T20:32:14.0599507Z     scale_ub=None,
2025-05-07T20:32:14.0599720Z     contiguous=False,
2025-05-07T20:32:14.0599941Z     compiled=False,
2025-05-07T20:32:14.0600149Z )
2025-05-07T20:32:14.0600464Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.0600942Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:14.0601211Z 
2025-05-07T20:32:14.0601287Z     @given(
2025-05-07T20:32:14.0601518Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.0601823Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.0602135Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.0602465Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.0602842Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.0603129Z     )
2025-05-07T20:32:14.0603482Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.0604232Z     def test_silu_mul_quant(
2025-05-07T20:32:14.0604496Z         self,
2025-05-07T20:32:14.0604700Z         T: int,
2025-05-07T20:32:14.0604914Z         D: int,
2025-05-07T20:32:14.0605139Z         scale_ub: Optional[float],
2025-05-07T20:32:14.0605447Z         contiguous: bool,
2025-05-07T20:32:14.0605698Z         compiled: bool,
2025-05-07T20:32:14.0606001Z     ) -> None:
2025-05-07T20:32:14.0606228Z         torch.manual_seed(2025)
2025-05-07T20:32:14.0606486Z     
2025-05-07T20:32:14.0606782Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.0607173Z     
2025-05-07T20:32:14.0607371Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.0607691Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.0608044Z         x = x_sign * x_clamp
2025-05-07T20:32:14.0608305Z         x0 = x[:, :D]
2025-05-07T20:32:14.0608540Z         x1 = x[:, D:]
2025-05-07T20:32:14.0608765Z     
2025-05-07T20:32:14.0608984Z         if contiguous:
2025-05-07T20:32:14.0609265Z             x0 = x0.contiguous()
2025-05-07T20:32:14.0609558Z             x1 = x1.contiguous()
2025-05-07T20:32:14.0609817Z     
2025-05-07T20:32:14.0617491Z         if scale_ub is not None:
2025-05-07T20:32:14.0617840Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.0618190Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.0618509Z             )
2025-05-07T20:32:14.0618701Z         else:
2025-05-07T20:32:14.0618920Z             scale_ub_tensor = None
2025-05-07T20:32:14.0619198Z     
2025-05-07T20:32:14.0619459Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.0619785Z             op = silu_mul_quant
2025-05-07T20:32:14.0620041Z             if compiled:
2025-05-07T20:32:14.0620295Z                 op = torch.compile(op)
2025-05-07T20:32:14.0620594Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.0620872Z     
2025-05-07T20:32:14.0621067Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.0621232Z 
2025-05-07T20:32:14.0621337Z moe/activation_test.py:117: 
2025-05-07T20:32:14.0621636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.0621970Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.0622244Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.0622934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.0623626Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.0624161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.0624834Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.0625494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.0626030Z     kernel = self.compile(
2025-05-07T20:32:14.0626568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.0627223Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.0627619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.0627848Z 
2025-05-07T20:32:14.0628064Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2ab12dc0>
2025-05-07T20:32:14.0629148Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.0630611Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2ad67790>}
2025-05-07T20:32:14.0632143Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.0633174Z context = <triton._C.libtriton.ir.context object at 0x7feb2aaaf0f0>
2025-05-07T20:32:14.0633462Z 
2025-05-07T20:32:14.0633634Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.0634194Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.0634663Z                            module_map=module_map)
2025-05-07T20:32:14.0635031Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.0635380Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.0635639Z E       ^
2025-05-07T20:32:14.0636114Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.0636565Z 
2025-05-07T20:32:14.0636998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.0637502Z 
2025-05-07T20:32:14.0637606Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.0638015Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.0638413Z     T=2048,
2025-05-07T20:32:14.0638595Z     D=7168,
2025-05-07T20:32:14.0638794Z     scale_ub=None,
2025-05-07T20:32:14.0639043Z     contiguous=False,
2025-05-07T20:32:14.0639287Z     compiled=True,
2025-05-07T20:32:14.0639490Z )
2025-05-07T20:32:14.1817801Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.1818533Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.1819150Z 
2025-05-07T20:32:14.1819352Z     @given(
2025-05-07T20:32:14.1819932Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.1820668Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.1821229Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.1821836Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.1822427Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.1822950Z     )
2025-05-07T20:32:14.1823579Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.1824376Z     def test_silu_mul_quant(
2025-05-07T20:32:14.1824821Z         self,
2025-05-07T20:32:14.1825169Z         T: int,
2025-05-07T20:32:14.1825521Z         D: int,
2025-05-07T20:32:14.1825912Z         scale_ub: Optional[float],
2025-05-07T20:32:14.1826407Z         contiguous: bool,
2025-05-07T20:32:14.1826840Z         compiled: bool,
2025-05-07T20:32:14.1827237Z     ) -> None:
2025-05-07T20:32:14.1827636Z         torch.manual_seed(2025)
2025-05-07T20:32:14.1828072Z     
2025-05-07T20:32:14.1828553Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.1829172Z     
2025-05-07T20:32:14.1829417Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.1829707Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.1830077Z         x = x_sign * x_clamp
2025-05-07T20:32:14.1830317Z         x0 = x[:, :D]
2025-05-07T20:32:14.1830530Z         x1 = x[:, D:]
2025-05-07T20:32:14.1830734Z     
2025-05-07T20:32:14.1830922Z         if contiguous:
2025-05-07T20:32:14.1831155Z             x0 = x0.contiguous()
2025-05-07T20:32:14.1831408Z             x1 = x1.contiguous()
2025-05-07T20:32:14.1831647Z     
2025-05-07T20:32:14.1831838Z         if scale_ub is not None:
2025-05-07T20:32:14.1832106Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.1832443Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.1832748Z             )
2025-05-07T20:32:14.1833050Z         else:
2025-05-07T20:32:14.1833257Z             scale_ub_tensor = None
2025-05-07T20:32:14.1833511Z     
2025-05-07T20:32:14.1833737Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1834053Z             op = silu_mul_quant
2025-05-07T20:32:14.1834421Z             if compiled:
2025-05-07T20:32:14.1834670Z                 op = torch.compile(op)
2025-05-07T20:32:14.1834962Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1835234Z     
2025-05-07T20:32:14.1835421Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.1835583Z 
2025-05-07T20:32:14.1835742Z moe/activation_test.py:117: 
2025-05-07T20:32:14.1836033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1836362Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.1836635Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1837191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.1837747Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.1838393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.1839104Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.1839665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.1840342Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.1840989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.1841518Z     kernel = self.compile(
2025-05-07T20:32:14.1842047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.1842699Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.1843087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1843315Z 
2025-05-07T20:32:14.1843520Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2aac47c0>
2025-05-07T20:32:14.1844607Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.1845989Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2acf8430>}
2025-05-07T20:32:14.1847330Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.1848353Z context = <triton._C.libtriton.ir.context object at 0x7feb2acf03f0>
2025-05-07T20:32:14.1848649Z 
2025-05-07T20:32:14.1848814Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.1849330Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.1849791Z                            module_map=module_map)
2025-05-07T20:32:14.1850152Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.1850506Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.1850759Z E       ^
2025-05-07T20:32:14.1851217Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.1851672Z 
2025-05-07T20:32:14.1852085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.1852594Z 
2025-05-07T20:32:14.1852706Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.1853114Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.1853575Z     T=4096,
2025-05-07T20:32:14.1853765Z     D=7168,
2025-05-07T20:32:14.1853952Z     scale_ub=None,
2025-05-07T20:32:14.1854175Z     contiguous=False,
2025-05-07T20:32:14.1854399Z     compiled=True,
2025-05-07T20:32:14.1854682Z )
2025-05-07T20:32:14.1854993Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.1855484Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.1855754Z 
2025-05-07T20:32:14.1855835Z     @given(
2025-05-07T20:32:14.1856053Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.1856402Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.1856710Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.1857026Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.1857354Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.1857638Z     )
2025-05-07T20:32:14.1857989Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.1858417Z     def test_silu_mul_quant(
2025-05-07T20:32:14.1858663Z         self,
2025-05-07T20:32:14.1858853Z         T: int,
2025-05-07T20:32:14.1859045Z         D: int,
2025-05-07T20:32:14.1859297Z         scale_ub: Optional[float],
2025-05-07T20:32:14.1859578Z         contiguous: bool,
2025-05-07T20:32:14.1859805Z         compiled: bool,
2025-05-07T20:32:14.1860021Z     ) -> None:
2025-05-07T20:32:14.1860231Z         torch.manual_seed(2025)
2025-05-07T20:32:14.1860464Z     
2025-05-07T20:32:14.1860730Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.1861064Z     
2025-05-07T20:32:14.1861244Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.1861524Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.1861824Z         x = x_sign * x_clamp
2025-05-07T20:32:14.1862056Z         x0 = x[:, :D]
2025-05-07T20:32:14.1862272Z         x1 = x[:, D:]
2025-05-07T20:32:14.1862468Z     
2025-05-07T20:32:14.1862640Z         if contiguous:
2025-05-07T20:32:14.1862863Z             x0 = x0.contiguous()
2025-05-07T20:32:14.1863113Z             x1 = x1.contiguous()
2025-05-07T20:32:14.1863355Z     
2025-05-07T20:32:14.1863548Z         if scale_ub is not None:
2025-05-07T20:32:14.1863820Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.1864149Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.1864447Z             )
2025-05-07T20:32:14.1864633Z         else:
2025-05-07T20:32:14.1864838Z             scale_ub_tensor = None
2025-05-07T20:32:14.1865077Z     
2025-05-07T20:32:14.1865301Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1865608Z             op = silu_mul_quant
2025-05-07T20:32:14.1865850Z             if compiled:
2025-05-07T20:32:14.1866092Z                 op = torch.compile(op)
2025-05-07T20:32:14.1866380Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1866648Z     
2025-05-07T20:32:14.1866831Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.1866993Z 
2025-05-07T20:32:14.1867094Z moe/activation_test.py:117: 
2025-05-07T20:32:14.1867380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1867705Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.1867980Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1868524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.1869065Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.1869771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.1870512Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.1871042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.1871760Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.1872412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.1872935Z     kernel = self.compile(
2025-05-07T20:32:14.1873561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.1874209Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.1874595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1874857Z 
2025-05-07T20:32:14.1875069Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a99ab80>
2025-05-07T20:32:14.1876141Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.1877518Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2aa7e040>}
2025-05-07T20:32:14.1878863Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.1879881Z context = <triton._C.libtriton.ir.context object at 0x7feb2aa7d3f0>
2025-05-07T20:32:14.1880168Z 
2025-05-07T20:32:14.1880335Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.1880848Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.1881308Z                            module_map=module_map)
2025-05-07T20:32:14.1881664Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.1882007Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.1882261Z E       ^
2025-05-07T20:32:14.1882718Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.1883170Z 
2025-05-07T20:32:14.1883591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.1884094Z 
2025-05-07T20:32:14.5805861Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.5806513Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.5807054Z     T=16384,
2025-05-07T20:32:14.5807322Z     D=5120,
2025-05-07T20:32:14.5807582Z     scale_ub=1200.0,
2025-05-07T20:32:14.5807865Z     contiguous=False,
2025-05-07T20:32:14.5808154Z     compiled=False,
2025-05-07T20:32:14.5808417Z )
2025-05-07T20:32:14.5808798Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.5809294Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:14.5809575Z 
2025-05-07T20:32:14.5809660Z     @given(
2025-05-07T20:32:14.5809888Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.5810205Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.5810517Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.5810847Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.5811172Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.5811464Z     )
2025-05-07T20:32:14.5811812Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.5812256Z     def test_silu_mul_quant(
2025-05-07T20:32:14.5812499Z         self,
2025-05-07T20:32:14.5812697Z         T: int,
2025-05-07T20:32:14.5812889Z         D: int,
2025-05-07T20:32:14.5813111Z         scale_ub: Optional[float],
2025-05-07T20:32:14.5813389Z         contiguous: bool,
2025-05-07T20:32:14.5813624Z         compiled: bool,
2025-05-07T20:32:14.5813972Z     ) -> None:
2025-05-07T20:32:14.5814202Z         torch.manual_seed(2025)
2025-05-07T20:32:14.5814445Z     
2025-05-07T20:32:14.5814718Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.5815063Z     
2025-05-07T20:32:14.5815378Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.5815671Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.5815987Z         x = x_sign * x_clamp
2025-05-07T20:32:14.5816231Z         x0 = x[:, :D]
2025-05-07T20:32:14.5816455Z         x1 = x[:, D:]
2025-05-07T20:32:14.5816665Z     
2025-05-07T20:32:14.5816917Z         if contiguous:
2025-05-07T20:32:14.5817156Z             x0 = x0.contiguous()
2025-05-07T20:32:14.5817413Z             x1 = x1.contiguous()
2025-05-07T20:32:14.5817656Z     
2025-05-07T20:32:14.5817846Z         if scale_ub is not None:
2025-05-07T20:32:14.5818128Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.5818467Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.5818777Z             )
2025-05-07T20:32:14.5818978Z         else:
2025-05-07T20:32:14.5819216Z             scale_ub_tensor = None
2025-05-07T20:32:14.5819486Z     
2025-05-07T20:32:14.5819720Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.5820040Z             op = silu_mul_quant
2025-05-07T20:32:14.5820284Z             if compiled:
2025-05-07T20:32:14.5820536Z                 op = torch.compile(op)
2025-05-07T20:32:14.5820833Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.5821104Z     
2025-05-07T20:32:14.5821306Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.5821473Z 
2025-05-07T20:32:14.5821587Z moe/activation_test.py:117: 
2025-05-07T20:32:14.5821877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.5822205Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.5822489Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.5823179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.5823864Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.5824407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.5825084Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.5825739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.5826262Z     kernel = self.compile(
2025-05-07T20:32:14.5826799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.5827444Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.5827833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.5828067Z 
2025-05-07T20:32:14.5828275Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a8b1040>
2025-05-07T20:32:14.5829416Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.5830887Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2aa7e8b0>}
2025-05-07T20:32:14.5832229Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.5833247Z context = <triton._C.libtriton.ir.context object at 0x7feb2a90b970>
2025-05-07T20:32:14.5833537Z 
2025-05-07T20:32:14.5833702Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.5834274Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.5834739Z                            module_map=module_map)
2025-05-07T20:32:14.5835095Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.5835524Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.5835783Z E       ^
2025-05-07T20:32:14.5836244Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.5836696Z 
2025-05-07T20:32:14.5837108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.5837669Z 
2025-05-07T20:32:14.5837772Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.5838181Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.5838574Z     T=16384,
2025-05-07T20:32:14.5838771Z     D=5120,
2025-05-07T20:32:14.5838967Z     scale_ub=1200.0,
2025-05-07T20:32:14.5839183Z     contiguous=True,
2025-05-07T20:32:14.5839405Z     compiled=True,
2025-05-07T20:32:14.5839610Z )
2025-05-07T20:32:14.5839923Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.5840423Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.5840696Z 
2025-05-07T20:32:14.5840776Z     @given(
2025-05-07T20:32:14.5841002Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.5841320Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.5841632Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.5841959Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.5842302Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.5842581Z     )
2025-05-07T20:32:14.5842928Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.5843370Z     def test_silu_mul_quant(
2025-05-07T20:32:14.5843605Z         self,
2025-05-07T20:32:14.5843804Z         T: int,
2025-05-07T20:32:14.5844004Z         D: int,
2025-05-07T20:32:14.5844215Z         scale_ub: Optional[float],
2025-05-07T20:32:14.5844487Z         contiguous: bool,
2025-05-07T20:32:14.5844741Z         compiled: bool,
2025-05-07T20:32:14.5844961Z     ) -> None:
2025-05-07T20:32:14.5845177Z         torch.manual_seed(2025)
2025-05-07T20:32:14.5845416Z     
2025-05-07T20:32:14.5845679Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.5846020Z     
2025-05-07T20:32:14.5846222Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.5846503Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.5846810Z         x = x_sign * x_clamp
2025-05-07T20:32:14.5847047Z         x0 = x[:, :D]
2025-05-07T20:32:14.5847261Z         x1 = x[:, D:]
2025-05-07T20:32:14.5847470Z     
2025-05-07T20:32:14.5847654Z         if contiguous:
2025-05-07T20:32:14.5847888Z             x0 = x0.contiguous()
2025-05-07T20:32:14.5848137Z             x1 = x1.contiguous()
2025-05-07T20:32:14.5848375Z     
2025-05-07T20:32:14.5848569Z         if scale_ub is not None:
2025-05-07T20:32:14.5848843Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.5849219Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.5849547Z             )
2025-05-07T20:32:14.5849732Z         else:
2025-05-07T20:32:14.5849942Z             scale_ub_tensor = None
2025-05-07T20:32:14.5850196Z     
2025-05-07T20:32:14.5850424Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.5850739Z             op = silu_mul_quant
2025-05-07T20:32:14.5850996Z             if compiled:
2025-05-07T20:32:14.5851235Z                 op = torch.compile(op)
2025-05-07T20:32:14.5851530Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.5851804Z     
2025-05-07T20:32:14.5851992Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.5852153Z 
2025-05-07T20:32:14.5852302Z moe/activation_test.py:117: 
2025-05-07T20:32:14.5852593Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.5852928Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.5853211Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.5853845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.5854403Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.5855052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.5855780Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.5856318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.5856992Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.5857644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.5858173Z     kernel = self.compile(
2025-05-07T20:32:14.5858717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.5859407Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.5859822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.5860050Z 
2025-05-07T20:32:14.5860260Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a906910>
2025-05-07T20:32:14.5861344Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.5862728Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2aa185e0>}
2025-05-07T20:32:14.5864090Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.5865115Z context = <triton._C.libtriton.ir.context object at 0x7feb2a8795f0>
2025-05-07T20:32:14.5865402Z 
2025-05-07T20:32:14.5865568Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.5866090Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.5866557Z                            module_map=module_map)
2025-05-07T20:32:14.5866919Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.5867271Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.5867530Z E       ^
2025-05-07T20:32:14.5867993Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.5868446Z 
2025-05-07T20:32:14.5868864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.5869375Z 
2025-05-07T20:32:14.8130896Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.8131448Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.8131958Z     T=16384,
2025-05-07T20:32:14.8132159Z     D=5120,
2025-05-07T20:32:14.8132351Z     scale_ub=None,
2025-05-07T20:32:14.8132561Z     contiguous=False,
2025-05-07T20:32:14.8132793Z     compiled=True,
2025-05-07T20:32:14.8133000Z )
2025-05-07T20:32:14.8133316Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.8133817Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.8134099Z 
2025-05-07T20:32:14.8134181Z     @given(
2025-05-07T20:32:14.8134543Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.8134848Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.8135158Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.8135492Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.8135931Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.8136220Z     )
2025-05-07T20:32:14.8136572Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.8137008Z     def test_silu_mul_quant(
2025-05-07T20:32:14.8137253Z         self,
2025-05-07T20:32:14.8137508Z         T: int,
2025-05-07T20:32:14.8137703Z         D: int,
2025-05-07T20:32:14.8137927Z         scale_ub: Optional[float],
2025-05-07T20:32:14.8138202Z         contiguous: bool,
2025-05-07T20:32:14.8138441Z         compiled: bool,
2025-05-07T20:32:14.8138659Z     ) -> None:
2025-05-07T20:32:14.8138878Z         torch.manual_seed(2025)
2025-05-07T20:32:14.8139119Z     
2025-05-07T20:32:14.8139429Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.8139804Z     
2025-05-07T20:32:14.8140001Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.8140287Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.8140604Z         x = x_sign * x_clamp
2025-05-07T20:32:14.8140840Z         x0 = x[:, :D]
2025-05-07T20:32:14.8141056Z         x1 = x[:, D:]
2025-05-07T20:32:14.8141254Z     
2025-05-07T20:32:14.8141437Z         if contiguous:
2025-05-07T20:32:14.8141668Z             x0 = x0.contiguous()
2025-05-07T20:32:14.8141922Z             x1 = x1.contiguous()
2025-05-07T20:32:14.8142168Z     
2025-05-07T20:32:14.8142358Z         if scale_ub is not None:
2025-05-07T20:32:14.8142620Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.8142954Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.8143262Z             )
2025-05-07T20:32:14.8143451Z         else:
2025-05-07T20:32:14.8143660Z             scale_ub_tensor = None
2025-05-07T20:32:14.8143910Z     
2025-05-07T20:32:14.8144145Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.8144458Z             op = silu_mul_quant
2025-05-07T20:32:14.8144701Z             if compiled:
2025-05-07T20:32:14.8144958Z                 op = torch.compile(op)
2025-05-07T20:32:14.8145250Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.8145525Z     
2025-05-07T20:32:14.8145713Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.8145878Z 
2025-05-07T20:32:14.8145979Z moe/activation_test.py:117: 
2025-05-07T20:32:14.8146264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.8146596Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.8146871Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.8147433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.8147991Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.8148652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.8149337Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.8149936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.8150614Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.8151265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.8151788Z     kernel = self.compile(
2025-05-07T20:32:14.8152318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.8152968Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.8153357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.8153637Z 
2025-05-07T20:32:14.8153843Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a84df40>
2025-05-07T20:32:14.8155009Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.8156390Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a97c5e0>}
2025-05-07T20:32:14.8157778Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.8158806Z context = <triton._C.libtriton.ir.context object at 0x7feb2a986330>
2025-05-07T20:32:14.8159130Z 
2025-05-07T20:32:14.8159312Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.8159836Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.8160305Z                            module_map=module_map)
2025-05-07T20:32:14.8160668Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.8161024Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.8161284Z E       ^
2025-05-07T20:32:14.8161741Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.8162196Z 
2025-05-07T20:32:14.8162609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.8163120Z 
2025-05-07T20:32:14.8163221Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.8163631Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.8164031Z     T=2048,
2025-05-07T20:32:14.8164217Z     D=5120,
2025-05-07T20:32:14.8164414Z     scale_ub=None,
2025-05-07T20:32:14.8164624Z     contiguous=False,
2025-05-07T20:32:14.8164852Z     compiled=True,
2025-05-07T20:32:14.8165049Z )
2025-05-07T20:32:14.9374067Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.9374576Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.9374856Z 
2025-05-07T20:32:14.9374959Z     @given(
2025-05-07T20:32:14.9375276Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.9375718Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.9376139Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.9376549Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.9376879Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.9377164Z     )
2025-05-07T20:32:14.9377510Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.9377957Z     def test_silu_mul_quant(
2025-05-07T20:32:14.9378199Z         self,
2025-05-07T20:32:14.9378391Z         T: int,
2025-05-07T20:32:14.9378586Z         D: int,
2025-05-07T20:32:14.9378804Z         scale_ub: Optional[float],
2025-05-07T20:32:14.9379076Z         contiguous: bool,
2025-05-07T20:32:14.9379326Z         compiled: bool,
2025-05-07T20:32:14.9379551Z     ) -> None:
2025-05-07T20:32:14.9379764Z         torch.manual_seed(2025)
2025-05-07T20:32:14.9380010Z     
2025-05-07T20:32:14.9380283Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.9380627Z     
2025-05-07T20:32:14.9380823Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.9381109Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.9381418Z         x = x_sign * x_clamp
2025-05-07T20:32:14.9381654Z         x0 = x[:, :D]
2025-05-07T20:32:14.9381875Z         x1 = x[:, D:]
2025-05-07T20:32:14.9382086Z     
2025-05-07T20:32:14.9382386Z         if contiguous:
2025-05-07T20:32:14.9382626Z             x0 = x0.contiguous()
2025-05-07T20:32:14.9382892Z             x1 = x1.contiguous()
2025-05-07T20:32:14.9383126Z     
2025-05-07T20:32:14.9383317Z         if scale_ub is not None:
2025-05-07T20:32:14.9383715Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.9384061Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.9384372Z             )
2025-05-07T20:32:14.9384571Z         else:
2025-05-07T20:32:14.9384782Z             scale_ub_tensor = None
2025-05-07T20:32:14.9385034Z     
2025-05-07T20:32:14.9385329Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.9385647Z             op = silu_mul_quant
2025-05-07T20:32:14.9385896Z             if compiled:
2025-05-07T20:32:14.9386143Z                 op = torch.compile(op)
2025-05-07T20:32:14.9386442Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.9386712Z     
2025-05-07T20:32:14.9386904Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.9387072Z 
2025-05-07T20:32:14.9387180Z moe/activation_test.py:117: 
2025-05-07T20:32:14.9387474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.9387808Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.9388100Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.9388658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.9389221Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.9389954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.9390647Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.9391172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.9391849Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.9392510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.9393036Z     kernel = self.compile(
2025-05-07T20:32:14.9393584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.9394239Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.9394639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.9394864Z 
2025-05-07T20:32:14.9395072Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a86f5b0>
2025-05-07T20:32:14.9396164Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.9397554Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2aa18c10>}
2025-05-07T20:32:14.9398911Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.9399982Z context = <triton._C.libtriton.ir.context object at 0x7feb2ab97430>
2025-05-07T20:32:14.9400269Z 
2025-05-07T20:32:14.9400437Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.9400959Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.9401430Z                            module_map=module_map)
2025-05-07T20:32:14.9401789Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.9402142Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.9402405Z E       ^
2025-05-07T20:32:14.9402935Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.9403384Z 
2025-05-07T20:32:14.9403976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.9405004Z 
2025-05-07T20:32:14.9405116Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.9405532Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.9405935Z     T=2048,
2025-05-07T20:32:14.9406123Z     D=5120,
2025-05-07T20:32:14.9406315Z     scale_ub=1200.0,
2025-05-07T20:32:14.9406603Z     contiguous=False,
2025-05-07T20:32:14.9406826Z     compiled=True,
2025-05-07T20:32:14.9407031Z )
2025-05-07T20:32:14.9407351Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.9407840Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:14.9408119Z 
2025-05-07T20:32:14.9408204Z     @given(
2025-05-07T20:32:14.9408435Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.9408743Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.9409097Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.9409462Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.9409796Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.9410077Z     )
2025-05-07T20:32:14.9410431Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.9410872Z     def test_silu_mul_quant(
2025-05-07T20:32:14.9411119Z         self,
2025-05-07T20:32:14.9411309Z         T: int,
2025-05-07T20:32:14.9411505Z         D: int,
2025-05-07T20:32:14.9411726Z         scale_ub: Optional[float],
2025-05-07T20:32:14.9411997Z         contiguous: bool,
2025-05-07T20:32:14.9412234Z         compiled: bool,
2025-05-07T20:32:14.9412458Z     ) -> None:
2025-05-07T20:32:14.9412668Z         torch.manual_seed(2025)
2025-05-07T20:32:14.9412916Z     
2025-05-07T20:32:14.9413185Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.9413519Z     
2025-05-07T20:32:14.9413715Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.9414013Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.9414317Z         x = x_sign * x_clamp
2025-05-07T20:32:14.9414557Z         x0 = x[:, :D]
2025-05-07T20:32:14.9414767Z         x1 = x[:, D:]
2025-05-07T20:32:14.9414970Z     
2025-05-07T20:32:14.9415154Z         if contiguous:
2025-05-07T20:32:14.9415383Z             x0 = x0.contiguous()
2025-05-07T20:32:14.9415642Z             x1 = x1.contiguous()
2025-05-07T20:32:14.9415882Z     
2025-05-07T20:32:14.9416073Z         if scale_ub is not None:
2025-05-07T20:32:14.9416350Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.9416676Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.9416980Z             )
2025-05-07T20:32:14.9417180Z         else:
2025-05-07T20:32:14.9417391Z             scale_ub_tensor = None
2025-05-07T20:32:14.9417640Z     
2025-05-07T20:32:14.9417873Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.9418181Z             op = silu_mul_quant
2025-05-07T20:32:14.9418437Z             if compiled:
2025-05-07T20:32:14.9418684Z                 op = torch.compile(op)
2025-05-07T20:32:14.9418978Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.9419256Z     
2025-05-07T20:32:14.9419448Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.9419611Z 
2025-05-07T20:32:14.9419712Z moe/activation_test.py:117: 
2025-05-07T20:32:14.9420008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.9420334Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.9420622Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.9421171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.9421816Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.9422473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.9423159Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.9423793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.9424473Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.9425134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.9425701Z     kernel = self.compile(
2025-05-07T20:32:14.9426241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.9426885Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.9427279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.9427514Z 
2025-05-07T20:32:14.9427722Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a71af70>
2025-05-07T20:32:14.9428810Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.9430257Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a748820>}
2025-05-07T20:32:14.9431606Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.9432629Z context = <triton._C.libtriton.ir.context object at 0x7feb2aa00230>
2025-05-07T20:32:14.9432916Z 
2025-05-07T20:32:14.9433088Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.9433611Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.9434077Z                            module_map=module_map)
2025-05-07T20:32:14.9434446Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.9434802Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.9435056Z E       ^
2025-05-07T20:32:14.9435525Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.9435977Z 
2025-05-07T20:32:14.9436401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.9436906Z 
2025-05-07T20:32:15.1689811Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.1690295Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.1690915Z     T=4096,
2025-05-07T20:32:15.1691172Z     D=5120,
2025-05-07T20:32:15.1691422Z     scale_ub=1200.0,
2025-05-07T20:32:15.1691718Z     contiguous=True,
2025-05-07T20:32:15.1692013Z     compiled=True,
2025-05-07T20:32:15.1692211Z )
2025-05-07T20:32:15.1692558Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.1693047Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:15.1693319Z 
2025-05-07T20:32:15.1693397Z     @given(
2025-05-07T20:32:15.1693622Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.1693928Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.1694232Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.1694557Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.1694879Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.1695166Z     )
2025-05-07T20:32:15.1695627Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.1696060Z     def test_silu_mul_quant(
2025-05-07T20:32:15.1696300Z         self,
2025-05-07T20:32:15.1696491Z         T: int,
2025-05-07T20:32:15.1696680Z         D: int,
2025-05-07T20:32:15.1697019Z         scale_ub: Optional[float],
2025-05-07T20:32:15.1697292Z         contiguous: bool,
2025-05-07T20:32:15.1697533Z         compiled: bool,
2025-05-07T20:32:15.1697749Z     ) -> None:
2025-05-07T20:32:15.1697962Z         torch.manual_seed(2025)
2025-05-07T20:32:15.1698206Z     
2025-05-07T20:32:15.1698473Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.1698874Z     
2025-05-07T20:32:15.1699065Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.1699350Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.1699655Z         x = x_sign * x_clamp
2025-05-07T20:32:15.1699892Z         x0 = x[:, :D]
2025-05-07T20:32:15.1700098Z         x1 = x[:, D:]
2025-05-07T20:32:15.1700305Z     
2025-05-07T20:32:15.1700491Z         if contiguous:
2025-05-07T20:32:15.1700714Z             x0 = x0.contiguous()
2025-05-07T20:32:15.1700969Z             x1 = x1.contiguous()
2025-05-07T20:32:15.1701213Z     
2025-05-07T20:32:15.1701407Z         if scale_ub is not None:
2025-05-07T20:32:15.1701677Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.1702005Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.1702311Z             )
2025-05-07T20:32:15.1702507Z         else:
2025-05-07T20:32:15.1702713Z             scale_ub_tensor = None
2025-05-07T20:32:15.1702968Z     
2025-05-07T20:32:15.1703192Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.1703505Z             op = silu_mul_quant
2025-05-07T20:32:15.1703937Z             if compiled:
2025-05-07T20:32:15.1704183Z                 op = torch.compile(op)
2025-05-07T20:32:15.1704478Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.1704753Z     
2025-05-07T20:32:15.1704936Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.1705106Z 
2025-05-07T20:32:15.1705205Z moe/activation_test.py:117: 
2025-05-07T20:32:15.1705496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.1705826Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.1706109Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.1706665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:15.1707218Z     return fn(*args, **kwargs)
2025-05-07T20:32:15.1707873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.1708559Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.1709119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.1709877Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.1710539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.1711066Z     kernel = self.compile(
2025-05-07T20:32:15.1711610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.1712253Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.1712643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.1712874Z 
2025-05-07T20:32:15.1713085Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a9f3160>
2025-05-07T20:32:15.1714163Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.1715608Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a6e1430>}
2025-05-07T20:32:15.1717051Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.1718077Z context = <triton._C.libtriton.ir.context object at 0x7feb2a6d57f0>
2025-05-07T20:32:15.1718365Z 
2025-05-07T20:32:15.1718536Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.1719107Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.1719576Z                            module_map=module_map)
2025-05-07T20:32:15.1719943Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.1720290Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.1720545Z E       ^
2025-05-07T20:32:15.1721014Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.1721467Z 
2025-05-07T20:32:15.1721889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.1722396Z 
2025-05-07T20:32:15.1722503Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.1722907Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.1723303Z     T=128,
2025-05-07T20:32:15.1723500Z     D=5120,
2025-05-07T20:32:15.1723693Z     scale_ub=1200.0,
2025-05-07T20:32:15.1723917Z     contiguous=False,
2025-05-07T20:32:15.1724139Z     compiled=True,
2025-05-07T20:32:15.1724336Z )
2025-05-07T20:32:15.4923625Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.4924319Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:15.4924762Z 
2025-05-07T20:32:15.4924865Z     @given(
2025-05-07T20:32:15.4925174Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.4925588Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.4926009Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.4926441Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.4926861Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.4927209Z     )
2025-05-07T20:32:15.4927548Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.4927992Z     def test_silu_mul_quant(
2025-05-07T20:32:15.4928259Z         self,
2025-05-07T20:32:15.4928444Z         T: int,
2025-05-07T20:32:15.4928633Z         D: int,
2025-05-07T20:32:15.4928849Z         scale_ub: Optional[float],
2025-05-07T20:32:15.4929117Z         contiguous: bool,
2025-05-07T20:32:15.4929356Z         compiled: bool,
2025-05-07T20:32:15.4929615Z     ) -> None:
2025-05-07T20:32:15.4929833Z         torch.manual_seed(2025)
2025-05-07T20:32:15.4930071Z     
2025-05-07T20:32:15.4930334Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.4930670Z     
2025-05-07T20:32:15.4930855Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.4931134Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.4931439Z         x = x_sign * x_clamp
2025-05-07T20:32:15.4942828Z         x0 = x[:, :D]
2025-05-07T20:32:15.4943141Z         x1 = x[:, D:]
2025-05-07T20:32:15.4943378Z     
2025-05-07T20:32:15.4943574Z         if contiguous:
2025-05-07T20:32:15.4943804Z             x0 = x0.contiguous()
2025-05-07T20:32:15.4944056Z             x1 = x1.contiguous()
2025-05-07T20:32:15.4944296Z     
2025-05-07T20:32:15.4944476Z         if scale_ub is not None:
2025-05-07T20:32:15.4944748Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.4945088Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.4945518Z             )
2025-05-07T20:32:15.4945703Z         else:
2025-05-07T20:32:15.4945910Z             scale_ub_tensor = None
2025-05-07T20:32:15.4946154Z     
2025-05-07T20:32:15.4946378Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.4946815Z             op = silu_mul_quant
2025-05-07T20:32:15.4947069Z             if compiled:
2025-05-07T20:32:15.4947316Z                 op = torch.compile(op)
2025-05-07T20:32:15.4947614Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.4947889Z     
2025-05-07T20:32:15.4948079Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.4948311Z 
2025-05-07T20:32:15.4948413Z moe/activation_test.py:117: 
2025-05-07T20:32:15.4948709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.4949029Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.4949309Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.4949954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:15.4950515Z     return fn(*args, **kwargs)
2025-05-07T20:32:15.4951168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.4951862Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.4952392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.4953064Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.4953728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.4954257Z     kernel = self.compile(
2025-05-07T20:32:15.4954787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.4955432Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.4955829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.4956052Z 
2025-05-07T20:32:15.4956271Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a6f4880>
2025-05-07T20:32:15.4957359Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.4958737Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a63f040>}
2025-05-07T20:32:15.4960082Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.4961103Z context = <triton._C.libtriton.ir.context object at 0x7feb2a640870>
2025-05-07T20:32:15.4961391Z 
2025-05-07T20:32:15.4961561Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.4962081Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.4962559Z                            module_map=module_map)
2025-05-07T20:32:15.4962924Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.4963277Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.4963523Z E       ^
2025-05-07T20:32:15.4963991Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.4964445Z 
2025-05-07T20:32:15.4964865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.4965375Z 
2025-05-07T20:32:15.4965487Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.4965969Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.4966366Z     T=16384,
2025-05-07T20:32:15.4966560Z     D=7168,
2025-05-07T20:32:15.4966740Z     scale_ub=1200.0,
2025-05-07T20:32:15.4966957Z     contiguous=True,
2025-05-07T20:32:15.4967250Z     compiled=True,
2025-05-07T20:32:15.4967447Z )
2025-05-07T20:32:15.4967765Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.4968261Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:15.4968533Z 
2025-05-07T20:32:15.4968610Z     @given(
2025-05-07T20:32:15.4968875Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.4969180Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.4969479Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.4969802Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.4970123Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.4970412Z     )
2025-05-07T20:32:15.4970755Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.4971191Z     def test_silu_mul_quant(
2025-05-07T20:32:15.4971424Z         self,
2025-05-07T20:32:15.4971614Z         T: int,
2025-05-07T20:32:15.4971805Z         D: int,
2025-05-07T20:32:15.4972020Z         scale_ub: Optional[float],
2025-05-07T20:32:15.4972285Z         contiguous: bool,
2025-05-07T20:32:15.4972512Z         compiled: bool,
2025-05-07T20:32:15.4972729Z     ) -> None:
2025-05-07T20:32:15.4972940Z         torch.manual_seed(2025)
2025-05-07T20:32:15.4973173Z     
2025-05-07T20:32:15.4973439Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.4973771Z     
2025-05-07T20:32:15.4973952Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.4974230Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.4974533Z         x = x_sign * x_clamp
2025-05-07T20:32:15.4974765Z         x0 = x[:, :D]
2025-05-07T20:32:15.4974980Z         x1 = x[:, D:]
2025-05-07T20:32:15.4975184Z     
2025-05-07T20:32:15.4975357Z         if contiguous:
2025-05-07T20:32:15.4975589Z             x0 = x0.contiguous()
2025-05-07T20:32:15.4975839Z             x1 = x1.contiguous()
2025-05-07T20:32:15.4976074Z     
2025-05-07T20:32:15.4976259Z         if scale_ub is not None:
2025-05-07T20:32:15.4976522Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.4976852Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.4977151Z             )
2025-05-07T20:32:15.4977337Z         else:
2025-05-07T20:32:15.4977542Z             scale_ub_tensor = None
2025-05-07T20:32:15.4977780Z     
2025-05-07T20:32:15.4978002Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.4978313Z             op = silu_mul_quant
2025-05-07T20:32:15.4978551Z             if compiled:
2025-05-07T20:32:15.4978792Z                 op = torch.compile(op)
2025-05-07T20:32:15.4979079Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.4979369Z     
2025-05-07T20:32:15.4979577Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.4979736Z 
2025-05-07T20:32:15.4979835Z moe/activation_test.py:117: 
2025-05-07T20:32:15.4980130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.4980446Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.4980720Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.4981266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:15.4981810Z     return fn(*args, **kwargs)
2025-05-07T20:32:15.4982467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.4983143Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.4983670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.4984390Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.4985042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.4985641Z     kernel = self.compile(
2025-05-07T20:32:15.4986168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.4986811Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.4987195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.4987458Z 
2025-05-07T20:32:15.4987671Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a63d580>
2025-05-07T20:32:15.4988751Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.4990255Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a63fb80>}
2025-05-07T20:32:15.4991612Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.4992626Z context = <triton._C.libtriton.ir.context object at 0x7feb2a694eb0>
2025-05-07T20:32:15.4992908Z 
2025-05-07T20:32:15.4993077Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.4993584Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.4994040Z                            module_map=module_map)
2025-05-07T20:32:15.4994402Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.4994745Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.4994995Z E       ^
2025-05-07T20:32:15.4995457Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.4995904Z 
2025-05-07T20:32:15.4996324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.4996831Z 
2025-05-07T20:32:15.7773207Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.7773625Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.7774154Z     T=16384,
2025-05-07T20:32:15.7774372Z     D=5120,
2025-05-07T20:32:15.7774633Z     scale_ub=1200.0,
2025-05-07T20:32:15.7774887Z     contiguous=True,
2025-05-07T20:32:15.7775103Z     compiled=False,
2025-05-07T20:32:15.7775333Z )
2025-05-07T20:32:15.7775653Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.7776147Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:15.7776428Z 
2025-05-07T20:32:15.7776508Z     @given(
2025-05-07T20:32:15.7776730Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.7777043Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.7777351Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.7777678Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.7777998Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.7778282Z     )
2025-05-07T20:32:15.7778623Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.7779060Z     def test_silu_mul_quant(
2025-05-07T20:32:15.7779306Z         self,
2025-05-07T20:32:15.7779524Z         T: int,
2025-05-07T20:32:15.7779737Z         D: int,
2025-05-07T20:32:15.7779951Z         scale_ub: Optional[float],
2025-05-07T20:32:15.7780220Z         contiguous: bool,
2025-05-07T20:32:15.7780575Z         compiled: bool,
2025-05-07T20:32:15.7780797Z     ) -> None:
2025-05-07T20:32:15.7781008Z         torch.manual_seed(2025)
2025-05-07T20:32:15.7781242Z     
2025-05-07T20:32:15.7781512Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.7781852Z     
2025-05-07T20:32:15.7782153Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.7782442Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.7782744Z         x = x_sign * x_clamp
2025-05-07T20:32:15.7782975Z         x0 = x[:, :D]
2025-05-07T20:32:15.7783188Z         x1 = x[:, D:]
2025-05-07T20:32:15.7783452Z     
2025-05-07T20:32:15.7783627Z         if contiguous:
2025-05-07T20:32:15.7783858Z             x0 = x0.contiguous()
2025-05-07T20:32:15.7784112Z             x1 = x1.contiguous()
2025-05-07T20:32:15.7784341Z     
2025-05-07T20:32:15.7784524Z         if scale_ub is not None:
2025-05-07T20:32:15.7784789Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.7785118Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.7785418Z             )
2025-05-07T20:32:15.7785602Z         else:
2025-05-07T20:32:15.7785808Z             scale_ub_tensor = None
2025-05-07T20:32:15.7786047Z     
2025-05-07T20:32:15.7786286Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.7786600Z             op = silu_mul_quant
2025-05-07T20:32:15.7786847Z             if compiled:
2025-05-07T20:32:15.7787090Z                 op = torch.compile(op)
2025-05-07T20:32:15.7787383Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.7787649Z     
2025-05-07T20:32:15.7787839Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.7788002Z 
2025-05-07T20:32:15.7788101Z moe/activation_test.py:117: 
2025-05-07T20:32:15.7788397Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.7788728Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.7789000Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.7789739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.7790497Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.7791028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.7791705Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.7792359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.7792882Z     kernel = self.compile(
2025-05-07T20:32:15.7793410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.7794052Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.7794440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.7794665Z 
2025-05-07T20:32:15.7794872Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a6a22b0>
2025-05-07T20:32:15.7795948Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.7797322Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a92e5e0>}
2025-05-07T20:32:15.7798661Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.7799687Z context = <triton._C.libtriton.ir.context object at 0x7feb2a942230>
2025-05-07T20:32:15.7799975Z 
2025-05-07T20:32:15.7800144Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.7800720Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.7801186Z                            module_map=module_map)
2025-05-07T20:32:15.7801623Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.7801978Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.7802244Z E       ^
2025-05-07T20:32:15.7802718Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.7803167Z 
2025-05-07T20:32:15.7803648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.7804332Z 
2025-05-07T20:32:15.7804438Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.7804857Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.7805285Z     T=1,
2025-05-07T20:32:15.7805472Z     D=7168,
2025-05-07T20:32:15.7805662Z     scale_ub=1200.0,
2025-05-07T20:32:15.7805890Z     contiguous=False,
2025-05-07T20:32:15.7806120Z     compiled=False,
2025-05-07T20:32:15.7806321Z )
2025-05-07T20:32:15.7806645Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.7807134Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:15.7807399Z 
2025-05-07T20:32:15.7807477Z     @given(
2025-05-07T20:32:15.7807709Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.7808032Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.7808340Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.7808672Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.7809004Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.7809320Z     )
2025-05-07T20:32:15.7809693Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.7810142Z     def test_silu_mul_quant(
2025-05-07T20:32:15.7810386Z         self,
2025-05-07T20:32:15.7810579Z         T: int,
2025-05-07T20:32:15.7810780Z         D: int,
2025-05-07T20:32:15.7811003Z         scale_ub: Optional[float],
2025-05-07T20:32:15.7811279Z         contiguous: bool,
2025-05-07T20:32:15.7811524Z         compiled: bool,
2025-05-07T20:32:15.7811749Z     ) -> None:
2025-05-07T20:32:15.7811964Z         torch.manual_seed(2025)
2025-05-07T20:32:15.7812210Z     
2025-05-07T20:32:15.7812492Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.7812836Z     
2025-05-07T20:32:15.7813033Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.7813328Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.7813643Z         x = x_sign * x_clamp
2025-05-07T20:32:15.7813881Z         x0 = x[:, :D]
2025-05-07T20:32:15.7814104Z         x1 = x[:, D:]
2025-05-07T20:32:15.7814316Z     
2025-05-07T20:32:15.7814508Z         if contiguous:
2025-05-07T20:32:15.7814753Z             x0 = x0.contiguous()
2025-05-07T20:32:15.7815020Z             x1 = x1.contiguous()
2025-05-07T20:32:15.7815261Z     
2025-05-07T20:32:15.7815459Z         if scale_ub is not None:
2025-05-07T20:32:15.7815739Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.7816073Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.7816391Z             )
2025-05-07T20:32:15.7816593Z         else:
2025-05-07T20:32:15.7816805Z             scale_ub_tensor = None
2025-05-07T20:32:15.7817060Z     
2025-05-07T20:32:15.7817295Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.7817611Z             op = silu_mul_quant
2025-05-07T20:32:15.7817861Z             if compiled:
2025-05-07T20:32:15.7818111Z                 op = torch.compile(op)
2025-05-07T20:32:15.7818402Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.7818680Z     
2025-05-07T20:32:15.7818878Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.7819119Z 
2025-05-07T20:32:15.7819224Z moe/activation_test.py:117: 
2025-05-07T20:32:15.7819516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.7819906Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.7820298Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.7820986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.7821676Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.7822219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.7822959Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.7823619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.7824151Z     kernel = self.compile(
2025-05-07T20:32:15.7824698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.7825345Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.7825747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.7825979Z 
2025-05-07T20:32:15.7826187Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a84c880>
2025-05-07T20:32:15.7827271Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.7828654Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a92e9d0>}
2025-05-07T20:32:15.7830094Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.7831122Z context = <triton._C.libtriton.ir.context object at 0x7feb2a660930>
2025-05-07T20:32:15.7831409Z 
2025-05-07T20:32:15.7831586Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.7832109Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.7832577Z                            module_map=module_map)
2025-05-07T20:32:15.7832943Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.7833307Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.7833566Z E       ^
2025-05-07T20:32:15.7834040Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.7834492Z 
2025-05-07T20:32:15.7834914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.7835427Z 
2025-05-07T20:32:15.7835541Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.7835955Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.7836370Z     T=4096,
2025-05-07T20:32:15.7836561Z     D=7168,
2025-05-07T20:32:15.7836752Z     scale_ub=1200.0,
2025-05-07T20:32:15.7836988Z     contiguous=False,
2025-05-07T20:32:15.7837221Z     compiled=True,
2025-05-07T20:32:15.7837420Z )
2025-05-07T20:32:15.9036376Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.9036899Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:15.9037196Z 
2025-05-07T20:32:15.9037282Z     @given(
2025-05-07T20:32:15.9037513Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.9037824Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.9038125Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.9038558Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.9038883Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.9039168Z     )
2025-05-07T20:32:15.9039674Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.9040108Z     def test_silu_mul_quant(
2025-05-07T20:32:15.9040345Z         self,
2025-05-07T20:32:15.9040537Z         T: int,
2025-05-07T20:32:15.9040728Z         D: int,
2025-05-07T20:32:15.9040942Z         scale_ub: Optional[float],
2025-05-07T20:32:15.9041211Z         contiguous: bool,
2025-05-07T20:32:15.9041502Z         compiled: bool,
2025-05-07T20:32:15.9041722Z     ) -> None:
2025-05-07T20:32:15.9041935Z         torch.manual_seed(2025)
2025-05-07T20:32:15.9042167Z     
2025-05-07T20:32:15.9042429Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.9042769Z     
2025-05-07T20:32:15.9042959Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.9043247Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.9043557Z         x = x_sign * x_clamp
2025-05-07T20:32:15.9043799Z         x0 = x[:, :D]
2025-05-07T20:32:15.9044002Z         x1 = x[:, D:]
2025-05-07T20:32:15.9044213Z     
2025-05-07T20:32:15.9044401Z         if contiguous:
2025-05-07T20:32:15.9044630Z             x0 = x0.contiguous()
2025-05-07T20:32:15.9044891Z             x1 = x1.contiguous()
2025-05-07T20:32:15.9045133Z     
2025-05-07T20:32:15.9045319Z         if scale_ub is not None:
2025-05-07T20:32:15.9045597Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.9045942Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.9046241Z             )
2025-05-07T20:32:15.9046431Z         else:
2025-05-07T20:32:15.9046644Z             scale_ub_tensor = None
2025-05-07T20:32:15.9046893Z     
2025-05-07T20:32:15.9047127Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.9047448Z             op = silu_mul_quant
2025-05-07T20:32:15.9047703Z             if compiled:
2025-05-07T20:32:15.9047945Z                 op = torch.compile(op)
2025-05-07T20:32:15.9048242Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.9048513Z     
2025-05-07T20:32:15.9048698Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.9048866Z 
2025-05-07T20:32:15.9048962Z moe/activation_test.py:117: 
2025-05-07T20:32:15.9049255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.9049655Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.9049931Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.9050489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:15.9051045Z     return fn(*args, **kwargs)
2025-05-07T20:32:15.9051703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.9052379Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.9052907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.9053590Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.9054239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.9054768Z     kernel = self.compile(
2025-05-07T20:32:15.9055303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.9055956Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.9056340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.9056576Z 
2025-05-07T20:32:15.9056783Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a67e490>
2025-05-07T20:32:15.9057863Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.9059359Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a654c10>}
2025-05-07T20:32:15.9060696Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.9061760Z context = <triton._C.libtriton.ir.context object at 0x7feb2a5e88b0>
2025-05-07T20:32:15.9062048Z 
2025-05-07T20:32:15.9062211Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.9062727Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.9063188Z                            module_map=module_map)
2025-05-07T20:32:15.9063549Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.9063900Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.9064155Z E       ^
2025-05-07T20:32:15.9064620Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.9065074Z 
2025-05-07T20:32:15.9065489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.9073074Z 
2025-05-07T20:32:15.9073255Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.9073701Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.9074110Z     T=128,
2025-05-07T20:32:15.9074290Z     D=7168,
2025-05-07T20:32:15.9074480Z     scale_ub=1200.0,
2025-05-07T20:32:15.9074706Z     contiguous=False,
2025-05-07T20:32:15.9074932Z     compiled=True,
2025-05-07T20:32:15.9075139Z )
2025-05-07T20:32:15.9075472Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.9075962Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:15.9076238Z 
2025-05-07T20:32:15.9076323Z     @given(
2025-05-07T20:32:15.9076552Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.9076867Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.9077168Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.9077498Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.9077831Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.9078110Z     )
2025-05-07T20:32:15.9078464Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.9078904Z     def test_silu_mul_quant(
2025-05-07T20:32:15.9079141Z         self,
2025-05-07T20:32:15.9079340Z         T: int,
2025-05-07T20:32:15.9079565Z         D: int,
2025-05-07T20:32:15.9079801Z         scale_ub: Optional[float],
2025-05-07T20:32:15.9080071Z         contiguous: bool,
2025-05-07T20:32:15.9080307Z         compiled: bool,
2025-05-07T20:32:15.9080529Z     ) -> None:
2025-05-07T20:32:15.9080745Z         torch.manual_seed(2025)
2025-05-07T20:32:15.9080988Z     
2025-05-07T20:32:15.9081261Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.9081602Z     
2025-05-07T20:32:15.9081792Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.9082076Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.9082387Z         x = x_sign * x_clamp
2025-05-07T20:32:15.9082621Z         x0 = x[:, :D]
2025-05-07T20:32:15.9082835Z         x1 = x[:, D:]
2025-05-07T20:32:15.9083038Z     
2025-05-07T20:32:15.9083222Z         if contiguous:
2025-05-07T20:32:15.9083456Z             x0 = x0.contiguous()
2025-05-07T20:32:15.9083713Z             x1 = x1.contiguous()
2025-05-07T20:32:15.9084033Z     
2025-05-07T20:32:15.9084225Z         if scale_ub is not None:
2025-05-07T20:32:15.9084495Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.9084832Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.9085144Z             )
2025-05-07T20:32:15.9085414Z         else:
2025-05-07T20:32:15.9085620Z             scale_ub_tensor = None
2025-05-07T20:32:15.9085866Z     
2025-05-07T20:32:15.9086095Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.9086402Z             op = silu_mul_quant
2025-05-07T20:32:15.9086647Z             if compiled:
2025-05-07T20:32:15.9086925Z                 op = torch.compile(op)
2025-05-07T20:32:15.9087207Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.9087471Z     
2025-05-07T20:32:15.9087657Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.9087816Z 
2025-05-07T20:32:15.9087916Z moe/activation_test.py:117: 
2025-05-07T20:32:15.9088201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.9088530Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.9088799Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.9089395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:15.9089959Z     return fn(*args, **kwargs)
2025-05-07T20:32:15.9090608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.9091283Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.9091815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.9092481Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.9093127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.9093645Z     kernel = self.compile(
2025-05-07T20:32:15.9094175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.9094817Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.9095212Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.9095435Z 
2025-05-07T20:32:15.9095643Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a606610>
2025-05-07T20:32:15.9096713Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.9098081Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a5e6820>}
2025-05-07T20:32:15.9099416Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.9100427Z context = <triton._C.libtriton.ir.context object at 0x7feb2a509a70>
2025-05-07T20:32:15.9100722Z 
2025-05-07T20:32:15.9100883Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.9101395Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.9101854Z                            module_map=module_map)
2025-05-07T20:32:15.9102209Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.9102551Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.9102801Z E       ^
2025-05-07T20:32:15.9103261Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.9103937Z 
2025-05-07T20:32:15.9104468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.9104982Z 
2025-05-07T20:32:16.0854215Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.0856599Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.0857858Z     T=2048,
2025-05-07T20:32:16.0858264Z     D=7168,
2025-05-07T20:32:16.0858636Z     scale_ub=None,
2025-05-07T20:32:16.0859071Z     contiguous=True,
2025-05-07T20:32:16.0859486Z     compiled=True,
2025-05-07T20:32:16.0859702Z )
2025-05-07T20:32:16.0860132Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.0860642Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.0860917Z 
2025-05-07T20:32:16.0861004Z     @given(
2025-05-07T20:32:16.0861238Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.0861564Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.0861892Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.0862227Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.0862563Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.0862858Z     )
2025-05-07T20:32:16.0863227Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.0863674Z     def test_silu_mul_quant(
2025-05-07T20:32:16.0863925Z         self,
2025-05-07T20:32:16.0864131Z         T: int,
2025-05-07T20:32:16.0864327Z         D: int,
2025-05-07T20:32:16.0864558Z         scale_ub: Optional[float],
2025-05-07T20:32:16.0864844Z         contiguous: bool,
2025-05-07T20:32:16.0865085Z         compiled: bool,
2025-05-07T20:32:16.0865321Z     ) -> None:
2025-05-07T20:32:16.0865544Z         torch.manual_seed(2025)
2025-05-07T20:32:16.0865787Z     
2025-05-07T20:32:16.0866071Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.0866424Z     
2025-05-07T20:32:16.0866619Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.0866922Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.0867247Z         x = x_sign * x_clamp
2025-05-07T20:32:16.0867491Z         x0 = x[:, :D]
2025-05-07T20:32:16.0867724Z         x1 = x[:, D:]
2025-05-07T20:32:16.0867948Z     
2025-05-07T20:32:16.0868135Z         if contiguous:
2025-05-07T20:32:16.0868382Z             x0 = x0.contiguous()
2025-05-07T20:32:16.0868656Z             x1 = x1.contiguous()
2025-05-07T20:32:16.0868916Z     
2025-05-07T20:32:16.0869110Z         if scale_ub is not None:
2025-05-07T20:32:16.0869401Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.0869753Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.0870197Z             )
2025-05-07T20:32:16.0870403Z         else:
2025-05-07T20:32:16.0870625Z             scale_ub_tensor = None
2025-05-07T20:32:16.0870880Z     
2025-05-07T20:32:16.0871124Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.0871451Z             op = silu_mul_quant
2025-05-07T20:32:16.0871703Z             if compiled:
2025-05-07T20:32:16.0871977Z                 op = torch.compile(op)
2025-05-07T20:32:16.0872288Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.0872570Z     
2025-05-07T20:32:16.0872772Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.0872940Z 
2025-05-07T20:32:16.0873054Z moe/activation_test.py:117: 
2025-05-07T20:32:16.0873361Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.0873695Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.0873988Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.0874557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.0875117Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.0875785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.0876580Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.0877125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.0877891Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.0878565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.0879113Z     kernel = self.compile(
2025-05-07T20:32:16.0879658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.0880420Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.0880831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.0881063Z 
2025-05-07T20:32:16.0881281Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a60b670>
2025-05-07T20:32:16.0882378Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.0883917Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a5314c0>}
2025-05-07T20:32:16.0885415Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.0886464Z context = <triton._C.libtriton.ir.context object at 0x7feb2a5439f0>
2025-05-07T20:32:16.0886753Z 
2025-05-07T20:32:16.0886934Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.0887457Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.0887943Z                            module_map=module_map)
2025-05-07T20:32:16.0888322Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.0888687Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.0888957Z E       ^
2025-05-07T20:32:16.0889436Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.0889894Z 
2025-05-07T20:32:16.0890324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.0890845Z 
2025-05-07T20:32:16.0890962Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.0891378Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.0891794Z     T=16384,
2025-05-07T20:32:16.0892000Z     D=5120,
2025-05-07T20:32:16.0892195Z     scale_ub=None,
2025-05-07T20:32:16.0892427Z     contiguous=False,
2025-05-07T20:32:16.0892663Z     compiled=False,
2025-05-07T20:32:16.0892870Z )
2025-05-07T20:32:16.0893203Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.0893730Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:16.0894022Z 
2025-05-07T20:32:16.0894102Z     @given(
2025-05-07T20:32:16.0894344Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.0894658Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.0894976Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.0895403Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.0895802Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.0896092Z     )
2025-05-07T20:32:16.0896451Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.0896898Z     def test_silu_mul_quant(
2025-05-07T20:32:16.0897212Z         self,
2025-05-07T20:32:16.0897415Z         T: int,
2025-05-07T20:32:16.0897625Z         D: int,
2025-05-07T20:32:16.0897848Z         scale_ub: Optional[float],
2025-05-07T20:32:16.0898128Z         contiguous: bool,
2025-05-07T20:32:16.0898378Z         compiled: bool,
2025-05-07T20:32:16.0898687Z     ) -> None:
2025-05-07T20:32:16.0898913Z         torch.manual_seed(2025)
2025-05-07T20:32:16.0899174Z     
2025-05-07T20:32:16.0899441Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.0899805Z     
2025-05-07T20:32:16.0900028Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.0900355Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.0902405Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.0904805Z 
2025-05-07T20:32:16.0904929Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:16.0905154Z 
2025-05-07T20:32:16.0905259Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.0905685Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.0906083Z     T=4096,
2025-05-07T20:32:16.0906280Z     D=7168,
2025-05-07T20:32:16.0906479Z     scale_ub=1200.0,
2025-05-07T20:32:16.0906698Z     contiguous=True,
2025-05-07T20:32:16.0906928Z     compiled=True,
2025-05-07T20:32:16.0907141Z )
2025-05-07T20:32:16.0907466Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.0907953Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:16.0908233Z 
2025-05-07T20:32:16.0908313Z     @given(
2025-05-07T20:32:16.0908545Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.0908851Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.0909168Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.0909511Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.0909916Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.0910243Z     )
2025-05-07T20:32:16.0910598Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.0911044Z     def test_silu_mul_quant(
2025-05-07T20:32:16.0911286Z         self,
2025-05-07T20:32:16.0911484Z         T: int,
2025-05-07T20:32:16.0911688Z         D: int,
2025-05-07T20:32:16.0911902Z         scale_ub: Optional[float],
2025-05-07T20:32:16.0912174Z         contiguous: bool,
2025-05-07T20:32:16.0912415Z         compiled: bool,
2025-05-07T20:32:16.0912641Z     ) -> None:
2025-05-07T20:32:16.0912853Z         torch.manual_seed(2025)
2025-05-07T20:32:16.0913100Z     
2025-05-07T20:32:16.0913370Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.0913714Z     
2025-05-07T20:32:16.0913915Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.0914200Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.0916215Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.0918195Z 
2025-05-07T20:32:16.0918316Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:16.0918538Z 
2025-05-07T20:32:16.0918641Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.0919056Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.0919598Z     T=16384,
2025-05-07T20:32:16.0919798Z     D=7168,
2025-05-07T20:32:16.0919992Z     scale_ub=None,
2025-05-07T20:32:16.0920198Z     contiguous=False,
2025-05-07T20:32:16.0920427Z     compiled=False,
2025-05-07T20:32:16.0920635Z )
2025-05-07T20:32:16.1977431Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.1978490Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:16.1978849Z 
2025-05-07T20:32:16.1978935Z     @given(
2025-05-07T20:32:16.1979177Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.1979503Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.1979811Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.1980209Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.1980552Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.1980839Z     )
2025-05-07T20:32:16.1981209Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.1981666Z     def test_silu_mul_quant(
2025-05-07T20:32:16.1981908Z         self,
2025-05-07T20:32:16.1982109Z         T: int,
2025-05-07T20:32:16.1982314Z         D: int,
2025-05-07T20:32:16.1982531Z         scale_ub: Optional[float],
2025-05-07T20:32:16.1982817Z         contiguous: bool,
2025-05-07T20:32:16.1983063Z         compiled: bool,
2025-05-07T20:32:16.1983302Z     ) -> None:
2025-05-07T20:32:16.1983521Z         torch.manual_seed(2025)
2025-05-07T20:32:16.1983774Z     
2025-05-07T20:32:16.1984053Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.1986161Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.1988102Z 
2025-05-07T20:32:16.1988225Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.1988455Z 
2025-05-07T20:32:16.1988562Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.1988983Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.1989395Z     T=2048,
2025-05-07T20:32:16.1989582Z     D=7168,
2025-05-07T20:32:16.1989781Z     scale_ub=1200.0,
2025-05-07T20:32:16.1990148Z     contiguous=True,
2025-05-07T20:32:16.1990369Z     compiled=True,
2025-05-07T20:32:16.1990581Z )
2025-05-07T20:32:16.1990903Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.1991397Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:16.1991685Z 
2025-05-07T20:32:16.1991764Z     @given(
2025-05-07T20:32:16.1991999Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.1992312Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.1992628Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.1992970Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.1993312Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.1993600Z     )
2025-05-07T20:32:16.1993959Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.1994408Z     def test_silu_mul_quant(
2025-05-07T20:32:16.1994653Z         self,
2025-05-07T20:32:16.1994984Z         T: int,
2025-05-07T20:32:16.1995187Z         D: int,
2025-05-07T20:32:16.1995404Z         scale_ub: Optional[float],
2025-05-07T20:32:16.1995686Z         contiguous: bool,
2025-05-07T20:32:16.1995943Z         compiled: bool,
2025-05-07T20:32:16.1996168Z     ) -> None:
2025-05-07T20:32:16.1996541Z         torch.manual_seed(2025)
2025-05-07T20:32:16.1996797Z     
2025-05-07T20:32:16.1997071Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.1997422Z     
2025-05-07T20:32:16.1997623Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.1997917Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.2000051Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.2001956Z 
2025-05-07T20:32:16.2002085Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:16.2002307Z 
2025-05-07T20:32:16.2002412Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.2002833Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.2003238Z     T=2048,
2025-05-07T20:32:16.2003433Z     D=7168,
2025-05-07T20:32:16.2003639Z     scale_ub=None,
2025-05-07T20:32:16.2004210Z     contiguous=True,
2025-05-07T20:32:16.2004448Z     compiled=False,
2025-05-07T20:32:16.2004662Z )
2025-05-07T20:32:16.2004978Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.2005479Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.2005766Z 
2025-05-07T20:32:16.2005846Z     @given(
2025-05-07T20:32:16.2006082Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.2006396Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.2006711Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.2007060Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.2007387Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.2007685Z     )
2025-05-07T20:32:16.2008039Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.2008489Z     def test_silu_mul_quant(
2025-05-07T20:32:16.2008733Z         self,
2025-05-07T20:32:16.2008935Z         T: int,
2025-05-07T20:32:16.2009137Z         D: int,
2025-05-07T20:32:16.2009369Z         scale_ub: Optional[float],
2025-05-07T20:32:16.2009689Z         contiguous: bool,
2025-05-07T20:32:16.2009934Z         compiled: bool,
2025-05-07T20:32:16.2010158Z     ) -> None:
2025-05-07T20:32:16.2010385Z         torch.manual_seed(2025)
2025-05-07T20:32:16.2010637Z     
2025-05-07T20:32:16.2010911Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.2011265Z     
2025-05-07T20:32:16.2011465Z >       x_sign = torch.sign(x)
2025-05-07T20:32:16.2013445Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.2015346Z 
2025-05-07T20:32:16.2015467Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:16.2015692Z 
2025-05-07T20:32:16.2015879Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.2016299Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.2016708Z     T=1,
2025-05-07T20:32:16.2016890Z     D=7168,
2025-05-07T20:32:16.2017090Z     scale_ub=1200.0,
2025-05-07T20:32:16.2017436Z     contiguous=True,
2025-05-07T20:32:16.2017666Z     compiled=False,
2025-05-07T20:32:16.2017881Z )
2025-05-07T20:32:16.3602708Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3603449Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.3604308Z 
2025-05-07T20:32:16.3604403Z     @given(
2025-05-07T20:32:16.3604636Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3604961Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3605281Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3605621Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3605956Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3606250Z     )
2025-05-07T20:32:16.3606606Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3607045Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3607304Z         self,
2025-05-07T20:32:16.3607506Z         T: int,
2025-05-07T20:32:16.3607705Z         D: int,
2025-05-07T20:32:16.3607932Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3608210Z         contiguous: bool,
2025-05-07T20:32:16.3608448Z         compiled: bool,
2025-05-07T20:32:16.3608683Z     ) -> None:
2025-05-07T20:32:16.3608911Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3609152Z     
2025-05-07T20:32:16.3609429Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3609908Z     
2025-05-07T20:32:16.3610236Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3610584Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3619985Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3620256Z         x0 = x[:, :D]
2025-05-07T20:32:16.3620592Z         x1 = x[:, D:]
2025-05-07T20:32:16.3620872Z     
2025-05-07T20:32:16.3621062Z         if contiguous:
2025-05-07T20:32:16.3621305Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3621586Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3621827Z     
2025-05-07T20:32:16.3622027Z         if scale_ub is not None:
2025-05-07T20:32:16.3622313Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3622654Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3622980Z             )
2025-05-07T20:32:16.3623187Z         else:
2025-05-07T20:32:16.3623399Z             scale_ub_tensor = None
2025-05-07T20:32:16.3623664Z     
2025-05-07T20:32:16.3623908Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3624226Z             op = silu_mul_quant
2025-05-07T20:32:16.3624494Z             if compiled:
2025-05-07T20:32:16.3624761Z                 op = torch.compile(op)
2025-05-07T20:32:16.3625070Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3625346Z     
2025-05-07T20:32:16.3625551Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3625718Z 
2025-05-07T20:32:16.3625832Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3626136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3626480Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3626775Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3627471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3628177Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3628726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3629415Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3630351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3630893Z     kernel = self.compile(
2025-05-07T20:32:16.3631602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3632269Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3632671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3632910Z 
2025-05-07T20:32:16.3633121Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a43caf0>
2025-05-07T20:32:16.3634283Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3635685Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a1e8040>}
2025-05-07T20:32:16.3637040Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3638076Z context = <triton._C.libtriton.ir.context object at 0x7feb2a1eadf0>
2025-05-07T20:32:16.3638378Z 
2025-05-07T20:32:16.3638546Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3639082Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3639549Z                            module_map=module_map)
2025-05-07T20:32:16.3639933Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3640346Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3640609Z E       ^
2025-05-07T20:32:16.3641095Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3641557Z 
2025-05-07T20:32:16.3641976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3642496Z 
2025-05-07T20:32:16.3642613Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3643029Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3643442Z     T=128,
2025-05-07T20:32:16.3643641Z     D=5120,
2025-05-07T20:32:16.3643837Z     scale_ub=None,
2025-05-07T20:32:16.3644067Z     contiguous=True,
2025-05-07T20:32:16.3644296Z     compiled=False,
2025-05-07T20:32:16.3644515Z )
2025-05-07T20:32:16.3644834Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3645333Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.3645610Z 
2025-05-07T20:32:16.3645691Z     @given(
2025-05-07T20:32:16.3645928Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3646237Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3646551Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3646893Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3647218Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3647510Z     )
2025-05-07T20:32:16.3647867Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3648304Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3648556Z         self,
2025-05-07T20:32:16.3648760Z         T: int,
2025-05-07T20:32:16.3648959Z         D: int,
2025-05-07T20:32:16.3649185Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3649464Z         contiguous: bool,
2025-05-07T20:32:16.3649711Z         compiled: bool,
2025-05-07T20:32:16.3649933Z     ) -> None:
2025-05-07T20:32:16.3650154Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3650458Z     
2025-05-07T20:32:16.3650730Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3651079Z     
2025-05-07T20:32:16.3651271Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3651634Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3651956Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3652205Z         x0 = x[:, :D]
2025-05-07T20:32:16.3652426Z         x1 = x[:, D:]
2025-05-07T20:32:16.3652643Z     
2025-05-07T20:32:16.3652835Z         if contiguous:
2025-05-07T20:32:16.3653063Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3653375Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3653625Z     
2025-05-07T20:32:16.3653816Z         if scale_ub is not None:
2025-05-07T20:32:16.3654099Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3654440Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3654757Z             )
2025-05-07T20:32:16.3654950Z         else:
2025-05-07T20:32:16.3655161Z             scale_ub_tensor = None
2025-05-07T20:32:16.3655414Z     
2025-05-07T20:32:16.3655640Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3655953Z             op = silu_mul_quant
2025-05-07T20:32:16.3656208Z             if compiled:
2025-05-07T20:32:16.3656448Z                 op = torch.compile(op)
2025-05-07T20:32:16.3656745Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3657018Z     
2025-05-07T20:32:16.3657203Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3657372Z 
2025-05-07T20:32:16.3657471Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3657769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3658093Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3658379Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3659067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3659760Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3660349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3661031Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3661694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3662223Z     kernel = self.compile(
2025-05-07T20:32:16.3662756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3663412Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3663811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3664038Z 
2025-05-07T20:32:16.3664242Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a206580>
2025-05-07T20:32:16.3665332Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3666721Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a1e89d0>}
2025-05-07T20:32:16.3668065Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3669089Z context = <triton._C.libtriton.ir.context object at 0x7feb2a477570>
2025-05-07T20:32:16.3669374Z 
2025-05-07T20:32:16.3669541Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3670123Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3670688Z                            module_map=module_map)
2025-05-07T20:32:16.3671046Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3671397Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3672293Z E       ^
2025-05-07T20:32:16.3672764Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3673213Z 
2025-05-07T20:32:16.3673626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3674177Z 
2025-05-07T20:32:16.3674278Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3674692Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3675090Z     T=128,
2025-05-07T20:32:16.3675271Z     D=7168,
2025-05-07T20:32:16.3675461Z     scale_ub=None,
2025-05-07T20:32:16.3675679Z     contiguous=True,
2025-05-07T20:32:16.3675901Z     compiled=False,
2025-05-07T20:32:16.3676103Z )
2025-05-07T20:32:16.4580212Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.4580954Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.4581349Z 
2025-05-07T20:32:16.4581452Z     @given(
2025-05-07T20:32:16.4581759Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.4582096Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.4582405Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.4582740Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.4583065Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.4583351Z     )
2025-05-07T20:32:16.4583702Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.4584135Z     def test_silu_mul_quant(
2025-05-07T20:32:16.4584378Z         self,
2025-05-07T20:32:16.4584581Z         T: int,
2025-05-07T20:32:16.4584776Z         D: int,
2025-05-07T20:32:16.4584998Z         scale_ub: Optional[float],
2025-05-07T20:32:16.4585269Z         contiguous: bool,
2025-05-07T20:32:16.4585511Z         compiled: bool,
2025-05-07T20:32:16.4585733Z     ) -> None:
2025-05-07T20:32:16.4585953Z         torch.manual_seed(2025)
2025-05-07T20:32:16.4586196Z     
2025-05-07T20:32:16.4586466Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.4586810Z     
2025-05-07T20:32:16.4587002Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.4587286Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.4587598Z         x = x_sign * x_clamp
2025-05-07T20:32:16.4587842Z         x0 = x[:, :D]
2025-05-07T20:32:16.4588053Z         x1 = x[:, D:]
2025-05-07T20:32:16.4588258Z     
2025-05-07T20:32:16.4588449Z         if contiguous:
2025-05-07T20:32:16.4588675Z             x0 = x0.contiguous()
2025-05-07T20:32:16.4588936Z             x1 = x1.contiguous()
2025-05-07T20:32:16.4589183Z     
2025-05-07T20:32:16.4589371Z         if scale_ub is not None:
2025-05-07T20:32:16.4589647Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.4590126Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.4590461Z             )
2025-05-07T20:32:16.4590661Z         else:
2025-05-07T20:32:16.4590874Z             scale_ub_tensor = None
2025-05-07T20:32:16.4591131Z     
2025-05-07T20:32:16.4591365Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.4591677Z             op = silu_mul_quant
2025-05-07T20:32:16.4591931Z             if compiled:
2025-05-07T20:32:16.4592182Z                 op = torch.compile(op)
2025-05-07T20:32:16.4592479Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.4592749Z     
2025-05-07T20:32:16.4592943Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.4593110Z 
2025-05-07T20:32:16.4593222Z moe/activation_test.py:117: 
2025-05-07T20:32:16.4593512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.4594067Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.4594354Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.4595231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.4595925Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.4596471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.4597149Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.4597883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.4598416Z     kernel = self.compile(
2025-05-07T20:32:16.4598958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.4599613Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.4600014Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.4600289Z 
2025-05-07T20:32:16.4600501Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a205550>
2025-05-07T20:32:16.4601580Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.4602981Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a301430>}
2025-05-07T20:32:16.4604598Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.4605638Z context = <triton._C.libtriton.ir.context object at 0x7feb2a2eeab0>
2025-05-07T20:32:16.4605933Z 
2025-05-07T20:32:16.4606096Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.4606625Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.4607086Z                            module_map=module_map)
2025-05-07T20:32:16.4607452Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.4607805Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.4608062Z E       ^
2025-05-07T20:32:16.4608531Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.4608987Z 
2025-05-07T20:32:16.4609402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.4609913Z 
2025-05-07T20:32:16.4610024Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.4610439Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.4610843Z     T=2048,
2025-05-07T20:32:16.4611030Z     D=7168,
2025-05-07T20:32:16.4611222Z     scale_ub=1200.0,
2025-05-07T20:32:16.4611454Z     contiguous=True,
2025-05-07T20:32:16.4611676Z     compiled=False,
2025-05-07T20:32:16.4611879Z )
2025-05-07T20:32:16.4612199Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.4612699Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.4612976Z 
2025-05-07T20:32:16.4613057Z     @given(
2025-05-07T20:32:16.4613283Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.4613604Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.4613921Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.4614249Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.4614662Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.4614965Z     )
2025-05-07T20:32:16.4615318Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.4615754Z     def test_silu_mul_quant(
2025-05-07T20:32:16.4616129Z         self,
2025-05-07T20:32:16.4616331Z         T: int,
2025-05-07T20:32:16.4616523Z         D: int,
2025-05-07T20:32:16.4616752Z         scale_ub: Optional[float],
2025-05-07T20:32:16.4617023Z         contiguous: bool,
2025-05-07T20:32:16.4617258Z         compiled: bool,
2025-05-07T20:32:16.4617489Z     ) -> None:
2025-05-07T20:32:16.4617775Z         torch.manual_seed(2025)
2025-05-07T20:32:16.4618013Z     
2025-05-07T20:32:16.4618290Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.4620415Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.4622277Z 
2025-05-07T20:32:16.4622395Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.4622605Z 
2025-05-07T20:32:16.4622713Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.4623131Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.4623531Z     T=1,
2025-05-07T20:32:16.4623717Z     D=5120,
2025-05-07T20:32:16.4623903Z     scale_ub=1200.0,
2025-05-07T20:32:16.4624124Z     contiguous=True,
2025-05-07T20:32:16.4624353Z     compiled=False,
2025-05-07T20:32:16.4624556Z )
2025-05-07T20:32:16.5119838Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5120394Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.5120658Z 
2025-05-07T20:32:16.5120734Z     @given(
2025-05-07T20:32:16.5120965Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5121287Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5121585Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5121916Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5122239Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5122526Z     )
2025-05-07T20:32:16.5122906Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5123338Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5123577Z         self,
2025-05-07T20:32:16.5123771Z         T: int,
2025-05-07T20:32:16.5123958Z         D: int,
2025-05-07T20:32:16.5124173Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5124442Z         contiguous: bool,
2025-05-07T20:32:16.5124681Z         compiled: bool,
2025-05-07T20:32:16.5124900Z     ) -> None:
2025-05-07T20:32:16.5125119Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5125357Z     
2025-05-07T20:32:16.5125627Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5125967Z     
2025-05-07T20:32:16.5126160Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.5126448Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.5126760Z         x = x_sign * x_clamp
2025-05-07T20:32:16.5127006Z         x0 = x[:, :D]
2025-05-07T20:32:16.5127218Z         x1 = x[:, D:]
2025-05-07T20:32:16.5127425Z     
2025-05-07T20:32:16.5127609Z         if contiguous:
2025-05-07T20:32:16.5127834Z             x0 = x0.contiguous()
2025-05-07T20:32:16.5128098Z             x1 = x1.contiguous()
2025-05-07T20:32:16.5128333Z     
2025-05-07T20:32:16.5128514Z         if scale_ub is not None:
2025-05-07T20:32:16.5128784Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.5129249Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.5129574Z             )
2025-05-07T20:32:16.5129782Z         else:
2025-05-07T20:32:16.5129987Z             scale_ub_tensor = None
2025-05-07T20:32:16.5130365Z     
2025-05-07T20:32:16.5130592Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.5130901Z             op = silu_mul_quant
2025-05-07T20:32:16.5131150Z             if compiled:
2025-05-07T20:32:16.5131391Z                 op = torch.compile(op)
2025-05-07T20:32:16.5131692Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5132030Z     
2025-05-07T20:32:16.5132210Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.5132377Z 
2025-05-07T20:32:16.5132476Z moe/activation_test.py:117: 
2025-05-07T20:32:16.5132770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5133091Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.5133374Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.5134066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.5134745Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.5135278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.5135951Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.5136605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.5137136Z     kernel = self.compile(
2025-05-07T20:32:16.5137663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.5138311Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.5138706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.5138935Z 
2025-05-07T20:32:16.5139141Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a2d0e50>
2025-05-07T20:32:16.5140451Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.5141830Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a221160>}
2025-05-07T20:32:16.5143168Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.5144188Z context = <triton._C.libtriton.ir.context object at 0x7feb2a229b70>
2025-05-07T20:32:16.5144475Z 
2025-05-07T20:32:16.5144637Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.5145159Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.5145628Z                            module_map=module_map)
2025-05-07T20:32:16.5145990Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.5146335Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.5146599Z E       ^
2025-05-07T20:32:16.5147064Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.5147518Z 
2025-05-07T20:32:16.5147935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.5148449Z 
2025-05-07T20:32:16.5148549Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5148963Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5149416Z     T=2048,
2025-05-07T20:32:16.5149596Z     D=5120,
2025-05-07T20:32:16.5149785Z     scale_ub=None,
2025-05-07T20:32:16.5150050Z     contiguous=True,
2025-05-07T20:32:16.5150270Z     compiled=False,
2025-05-07T20:32:16.5150476Z )
2025-05-07T20:32:16.5150912Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5151401Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.5151686Z 
2025-05-07T20:32:16.5151760Z     @given(
2025-05-07T20:32:16.5151991Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5152334Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5152637Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5152967Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5153296Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5153571Z     )
2025-05-07T20:32:16.5153916Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5154362Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5154595Z         self,
2025-05-07T20:32:16.5154786Z         T: int,
2025-05-07T20:32:16.5154980Z         D: int,
2025-05-07T20:32:16.5155193Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5155458Z         contiguous: bool,
2025-05-07T20:32:16.5155692Z         compiled: bool,
2025-05-07T20:32:16.5155901Z     ) -> None:
2025-05-07T20:32:16.5156115Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5156351Z     
2025-05-07T20:32:16.5156613Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5156951Z     
2025-05-07T20:32:16.5157138Z >       x_sign = torch.sign(x)
2025-05-07T20:32:16.5159088Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.5160957Z 
2025-05-07T20:32:16.5161072Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:16.5161279Z 
2025-05-07T20:32:16.5161393Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5161799Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5162202Z     T=16384,
2025-05-07T20:32:16.5162389Z     D=5120,
2025-05-07T20:32:16.5162570Z     scale_ub=None,
2025-05-07T20:32:16.5162793Z     contiguous=True,
2025-05-07T20:32:16.5163016Z     compiled=False,
2025-05-07T20:32:16.5163215Z )
2025-05-07T20:32:16.5163575Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.5164173Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.5164476Z 
2025-05-07T20:32:16.5164619Z     @given(
2025-05-07T20:32:16.5165033Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.5171678Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.5172009Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.5172348Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.5172676Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.5172974Z     )
2025-05-07T20:32:16.5173335Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.5173782Z     def test_silu_mul_quant(
2025-05-07T20:32:16.5174029Z         self,
2025-05-07T20:32:16.5174227Z         T: int,
2025-05-07T20:32:16.5174416Z         D: int,
2025-05-07T20:32:16.5174642Z         scale_ub: Optional[float],
2025-05-07T20:32:16.5174921Z         contiguous: bool,
2025-05-07T20:32:16.5175244Z         compiled: bool,
2025-05-07T20:32:16.5175464Z     ) -> None:
2025-05-07T20:32:16.5175690Z         torch.manual_seed(2025)
2025-05-07T20:32:16.5175937Z     
2025-05-07T20:32:16.5176206Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.5178398Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.5180356Z 
2025-05-07T20:32:16.5180480Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.5180699Z 
2025-05-07T20:32:16.5180807Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.5181224Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.5181633Z     T=4096,
2025-05-07T20:32:16.5181818Z     D=5120,
2025-05-07T20:32:16.5182015Z     scale_ub=None,
2025-05-07T20:32:16.5182221Z     contiguous=True,
2025-05-07T20:32:16.5182449Z     compiled=False,
2025-05-07T20:32:16.5182656Z )
2025-05-07T20:32:16.6236725Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6237798Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.6238347Z 
2025-05-07T20:32:16.6238491Z     @given(
2025-05-07T20:32:16.6238942Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6239431Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6239730Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6240062Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6240397Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6240677Z     )
2025-05-07T20:32:16.6241025Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6241469Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6241717Z         self,
2025-05-07T20:32:16.6241913Z         T: int,
2025-05-07T20:32:16.6242116Z         D: int,
2025-05-07T20:32:16.6242335Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6242599Z         contiguous: bool,
2025-05-07T20:32:16.6242837Z         compiled: bool,
2025-05-07T20:32:16.6243063Z     ) -> None:
2025-05-07T20:32:16.6243273Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6243518Z     
2025-05-07T20:32:16.6243794Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6245877Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6247778Z 
2025-05-07T20:32:16.6247895Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.6248119Z 
2025-05-07T20:32:16.6248220Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6248642Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6249046Z     T=2048,
2025-05-07T20:32:16.6249232Z     D=5120,
2025-05-07T20:32:16.6249413Z     scale_ub=None,
2025-05-07T20:32:16.6249631Z     contiguous=False,
2025-05-07T20:32:16.6249860Z     compiled=False,
2025-05-07T20:32:16.6250058Z )
2025-05-07T20:32:16.6250523Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6251096Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:16.6251416Z 
2025-05-07T20:32:16.6251496Z     @given(
2025-05-07T20:32:16.6251857Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6252211Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6252552Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6252920Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6253293Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6253673Z     )
2025-05-07T20:32:16.6254066Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6254586Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6254852Z         self,
2025-05-07T20:32:16.6255052Z         T: int,
2025-05-07T20:32:16.6255266Z         D: int,
2025-05-07T20:32:16.6255501Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6255794Z         contiguous: bool,
2025-05-07T20:32:16.6256049Z         compiled: bool,
2025-05-07T20:32:16.6256286Z     ) -> None:
2025-05-07T20:32:16.6256508Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6256769Z     
2025-05-07T20:32:16.6257067Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6259675Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6262069Z 
2025-05-07T20:32:16.6262202Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.6262444Z 
2025-05-07T20:32:16.6262552Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6263026Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6263492Z     T=4096,
2025-05-07T20:32:16.6263685Z     D=7168,
2025-05-07T20:32:16.6263885Z     scale_ub=None,
2025-05-07T20:32:16.6264117Z     contiguous=True,
2025-05-07T20:32:16.6264348Z     compiled=True,
2025-05-07T20:32:16.6264564Z )
2025-05-07T20:32:16.6264920Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6265489Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.6265799Z 
2025-05-07T20:32:16.6265876Z     @given(
2025-05-07T20:32:16.6266120Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6266467Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6266800Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6267173Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6267541Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6267859Z     )
2025-05-07T20:32:16.6268268Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6268784Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6269049Z         self,
2025-05-07T20:32:16.6269252Z         T: int,
2025-05-07T20:32:16.6269461Z         D: int,
2025-05-07T20:32:16.6269697Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6270037Z         contiguous: bool,
2025-05-07T20:32:16.6270276Z         compiled: bool,
2025-05-07T20:32:16.6270495Z     ) -> None:
2025-05-07T20:32:16.6270703Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6270942Z     
2025-05-07T20:32:16.6271215Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6273403Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6275348Z 
2025-05-07T20:32:16.6275470Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.6275718Z 
2025-05-07T20:32:16.6275817Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6276234Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6276640Z     T=2048,
2025-05-07T20:32:16.6276817Z     D=5120,
2025-05-07T20:32:16.6277009Z     scale_ub=1200.0,
2025-05-07T20:32:16.6277233Z     contiguous=False,
2025-05-07T20:32:16.6277454Z     compiled=False,
2025-05-07T20:32:16.6277655Z )
2025-05-07T20:32:16.6277968Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6278458Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:16.6278737Z 
2025-05-07T20:32:16.6278818Z     @given(
2025-05-07T20:32:16.6279041Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6279346Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6279646Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6279971Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6280301Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6280580Z     )
2025-05-07T20:32:16.6280925Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6281364Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6281598Z         self,
2025-05-07T20:32:16.6281792Z         T: int,
2025-05-07T20:32:16.6281988Z         D: int,
2025-05-07T20:32:16.6282194Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6282459Z         contiguous: bool,
2025-05-07T20:32:16.6282692Z         compiled: bool,
2025-05-07T20:32:16.6282905Z     ) -> None:
2025-05-07T20:32:16.6283121Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6283358Z     
2025-05-07T20:32:16.6283627Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6285709Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6287621Z 
2025-05-07T20:32:16.6287735Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.6287953Z 
2025-05-07T20:32:16.6288056Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6288481Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6288879Z     T=4096,
2025-05-07T20:32:16.6289072Z     D=7168,
2025-05-07T20:32:16.6289265Z     scale_ub=1200.0,
2025-05-07T20:32:16.6289491Z     contiguous=True,
2025-05-07T20:32:16.6289712Z     compiled=False,
2025-05-07T20:32:16.6289920Z )
2025-05-07T20:32:16.6290242Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6290728Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.6291011Z 
2025-05-07T20:32:16.6291089Z     @given(
2025-05-07T20:32:16.6291315Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6291620Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6291984Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6292313Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6292640Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6292928Z     )
2025-05-07T20:32:16.6293389Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6293835Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6294071Z         self,
2025-05-07T20:32:16.6294265Z         T: int,
2025-05-07T20:32:16.6294464Z         D: int,
2025-05-07T20:32:16.6294719Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6294998Z         contiguous: bool,
2025-05-07T20:32:16.6295240Z         compiled: bool,
2025-05-07T20:32:16.6295455Z     ) -> None:
2025-05-07T20:32:16.6295672Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6295908Z     
2025-05-07T20:32:16.6296169Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6298292Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.6300222Z 
2025-05-07T20:32:16.6300343Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.6300565Z 
2025-05-07T20:32:16.6300668Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6301084Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6301484Z     T=16384,
2025-05-07T20:32:16.6301674Z     D=7168,
2025-05-07T20:32:16.6301867Z     scale_ub=None,
2025-05-07T20:32:16.6302074Z     contiguous=False,
2025-05-07T20:32:16.6302297Z     compiled=True,
2025-05-07T20:32:16.6302515Z )
2025-05-07T20:32:16.9538336Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.9539324Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:16.9539840Z 
2025-05-07T20:32:16.9539955Z     @given(
2025-05-07T20:32:16.9540212Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.9540515Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.9540818Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.9541148Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.9541475Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.9541750Z     )
2025-05-07T20:32:16.9542095Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.9542526Z     def test_silu_mul_quant(
2025-05-07T20:32:16.9542761Z         self,
2025-05-07T20:32:16.9542955Z         T: int,
2025-05-07T20:32:16.9543143Z         D: int,
2025-05-07T20:32:16.9543349Z         scale_ub: Optional[float],
2025-05-07T20:32:16.9543615Z         contiguous: bool,
2025-05-07T20:32:16.9543848Z         compiled: bool,
2025-05-07T20:32:16.9544068Z     ) -> None:
2025-05-07T20:32:16.9544282Z         torch.manual_seed(2025)
2025-05-07T20:32:16.9544520Z     
2025-05-07T20:32:16.9544779Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.9546849Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.9548822Z 
2025-05-07T20:32:16.9548939Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.9549157Z 
2025-05-07T20:32:16.9549371Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.9549893Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.9550287Z     T=4096,
2025-05-07T20:32:16.9550469Z     D=7168,
2025-05-07T20:32:16.9550659Z     scale_ub=None,
2025-05-07T20:32:16.9550860Z     contiguous=True,
2025-05-07T20:32:16.9551152Z     compiled=False,
2025-05-07T20:32:16.9551349Z )
2025-05-07T20:32:16.9551653Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.9552140Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.9552413Z 
2025-05-07T20:32:16.9552492Z     @given(
2025-05-07T20:32:16.9552723Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.9553035Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.9553338Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.9553662Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.9553990Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.9554272Z     )
2025-05-07T20:32:16.9554614Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.9555038Z     def test_silu_mul_quant(
2025-05-07T20:32:16.9555278Z         self,
2025-05-07T20:32:16.9555471Z         T: int,
2025-05-07T20:32:16.9555659Z         D: int,
2025-05-07T20:32:16.9555870Z         scale_ub: Optional[float],
2025-05-07T20:32:16.9556134Z         contiguous: bool,
2025-05-07T20:32:16.9556368Z         compiled: bool,
2025-05-07T20:32:16.9556580Z     ) -> None:
2025-05-07T20:32:16.9556793Z         torch.manual_seed(2025)
2025-05-07T20:32:16.9557026Z     
2025-05-07T20:32:16.9557287Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.9559353Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.9561290Z 
2025-05-07T20:32:16.9561404Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.9561613Z 
2025-05-07T20:32:16.9561719Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.9562127Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.9562516Z     T=16384,
2025-05-07T20:32:16.9562706Z     D=7168,
2025-05-07T20:32:16.9562890Z     scale_ub=None,
2025-05-07T20:32:16.9563091Z     contiguous=True,
2025-05-07T20:32:16.9563310Z     compiled=False,
2025-05-07T20:32:16.9563511Z )
2025-05-07T20:32:16.9563823Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.9564312Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.9564583Z 
2025-05-07T20:32:16.9564666Z     @given(
2025-05-07T20:32:16.9564883Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.9565194Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.9565501Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.9565817Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.9566143Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.9566419Z     )
2025-05-07T20:32:16.9566765Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.9567244Z     def test_silu_mul_quant(
2025-05-07T20:32:16.9567474Z         self,
2025-05-07T20:32:16.9567663Z         T: int,
2025-05-07T20:32:16.9567849Z         D: int,
2025-05-07T20:32:16.9568058Z         scale_ub: Optional[float],
2025-05-07T20:32:16.9568402Z         contiguous: bool,
2025-05-07T20:32:16.9568631Z         compiled: bool,
2025-05-07T20:32:16.9568844Z     ) -> None:
2025-05-07T20:32:16.9569052Z         torch.manual_seed(2025)
2025-05-07T20:32:16.9569284Z     
2025-05-07T20:32:16.9569544Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.9571637Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.9573513Z 
2025-05-07T20:32:16.9573625Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.9573838Z 
2025-05-07T20:32:16.9573941Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.9574339Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.9574726Z     T=16384,
2025-05-07T20:32:16.9574906Z     D=7168,
2025-05-07T20:32:16.9575082Z     scale_ub=1200.0,
2025-05-07T20:32:16.9575301Z     contiguous=True,
2025-05-07T20:32:16.9575516Z     compiled=False,
2025-05-07T20:32:16.9575707Z )
2025-05-07T20:32:16.9576016Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.9576506Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.9576775Z 
2025-05-07T20:32:16.9576850Z     @given(
2025-05-07T20:32:16.9577064Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.9577372Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.9577672Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.9577998Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.9578318Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.9578598Z     )
2025-05-07T20:32:16.9578932Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.9579364Z     def test_silu_mul_quant(
2025-05-07T20:32:16.9579615Z         self,
2025-05-07T20:32:16.9579794Z         T: int,
2025-05-07T20:32:16.9579983Z         D: int,
2025-05-07T20:32:16.9580195Z         scale_ub: Optional[float],
2025-05-07T20:32:16.9580454Z         contiguous: bool,
2025-05-07T20:32:16.9580685Z         compiled: bool,
2025-05-07T20:32:16.9580893Z     ) -> None:
2025-05-07T20:32:16.9581098Z         torch.manual_seed(2025)
2025-05-07T20:32:16.9581327Z     
2025-05-07T20:32:16.9581589Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.9583644Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:16.9585515Z 
2025-05-07T20:32:16.9585632Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:16.9585839Z 
2025-05-07T20:32:16.9585938Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.9586344Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.9586798Z     T=128,
2025-05-07T20:32:16.9586981Z     D=5120,
2025-05-07T20:32:16.9587157Z     scale_ub=1200.0,
2025-05-07T20:32:16.9587377Z     contiguous=False,
2025-05-07T20:32:16.9587596Z     compiled=False,
2025-05-07T20:32:16.9587787Z )
2025-05-07T20:32:17.1259035Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.1259798Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:17.1260095Z 
2025-05-07T20:32:17.1260176Z     @given(
2025-05-07T20:32:17.1260390Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.1260764Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.1261062Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.1261381Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.1261700Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.1261981Z     )
2025-05-07T20:32:17.1262323Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.1262750Z     def test_silu_mul_quant(
2025-05-07T20:32:17.1262984Z         self,
2025-05-07T20:32:17.1263171Z         T: int,
2025-05-07T20:32:17.1263355Z         D: int,
2025-05-07T20:32:17.1263571Z         scale_ub: Optional[float],
2025-05-07T20:32:17.1263832Z         contiguous: bool,
2025-05-07T20:32:17.1264056Z         compiled: bool,
2025-05-07T20:32:17.1264280Z     ) -> None:
2025-05-07T20:32:17.1264486Z         torch.manual_seed(2025)
2025-05-07T20:32:17.1264713Z     
2025-05-07T20:32:17.1264976Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.1265308Z     
2025-05-07T20:32:17.1265492Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.1265774Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.1266074Z         x = x_sign * x_clamp
2025-05-07T20:32:17.1266306Z         x0 = x[:, :D]
2025-05-07T20:32:17.1266516Z         x1 = x[:, D:]
2025-05-07T20:32:17.1266718Z     
2025-05-07T20:32:17.1266898Z         if contiguous:
2025-05-07T20:32:17.1267115Z             x0 = x0.contiguous()
2025-05-07T20:32:17.1267366Z             x1 = x1.contiguous()
2025-05-07T20:32:17.1267593Z     
2025-05-07T20:32:17.1267775Z         if scale_ub is not None:
2025-05-07T20:32:17.1268042Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.1268371Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.1268671Z             )
2025-05-07T20:32:17.1268852Z         else:
2025-05-07T20:32:17.1269052Z             scale_ub_tensor = None
2025-05-07T20:32:17.1269294Z     
2025-05-07T20:32:17.1269515Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.1269885Z             op = silu_mul_quant
2025-05-07T20:32:17.1270159Z             if compiled:
2025-05-07T20:32:17.1270405Z                 op = torch.compile(op)
2025-05-07T20:32:17.1270698Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.1270969Z     
2025-05-07T20:32:17.1271149Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.1271318Z 
2025-05-07T20:32:17.1271413Z moe/activation_test.py:117: 
2025-05-07T20:32:17.1271704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.1272026Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.1272301Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.1272984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.1273673Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.1274205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.1274880Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.1275526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.1276172Z     kernel = self.compile(
2025-05-07T20:32:17.1276705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.1283689Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.1284128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.1284367Z 
2025-05-07T20:32:17.1284575Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a0fd2b0>
2025-05-07T20:32:17.1285654Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.1287071Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a106ca0>}
2025-05-07T20:32:17.1288425Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.1289442Z context = <triton._C.libtriton.ir.context object at 0x7feb2a0610b0>
2025-05-07T20:32:17.1289732Z 
2025-05-07T20:32:17.1289896Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.1290413Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.1290887Z                            module_map=module_map)
2025-05-07T20:32:17.1291244Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.1291599Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.1291860Z E       ^
2025-05-07T20:32:17.1292310Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.1292766Z 
2025-05-07T20:32:17.1293180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.1293695Z 
2025-05-07T20:32:17.1293796Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.1294212Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.1294605Z     T=2048,
2025-05-07T20:32:17.1294788Z     D=7168,
2025-05-07T20:32:17.1294977Z     scale_ub=None,
2025-05-07T20:32:17.1295184Z     contiguous=False,
2025-05-07T20:32:17.1295410Z     compiled=False,
2025-05-07T20:32:17.1295619Z )
2025-05-07T20:32:17.1295930Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.1296421Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:17.1296688Z 
2025-05-07T20:32:17.1296767Z     @given(
2025-05-07T20:32:17.1296994Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.1297298Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.1297604Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.1297942Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.1298261Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.1298551Z     )
2025-05-07T20:32:17.1298895Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.1299330Z     def test_silu_mul_quant(
2025-05-07T20:32:17.1299570Z         self,
2025-05-07T20:32:17.1299760Z         T: int,
2025-05-07T20:32:17.1299946Z         D: int,
2025-05-07T20:32:17.1300162Z         scale_ub: Optional[float],
2025-05-07T20:32:17.1300428Z         contiguous: bool,
2025-05-07T20:32:17.1300660Z         compiled: bool,
2025-05-07T20:32:17.1300877Z     ) -> None:
2025-05-07T20:32:17.1301094Z         torch.manual_seed(2025)
2025-05-07T20:32:17.1301333Z     
2025-05-07T20:32:17.1301593Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.1303966Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.1305888Z 
2025-05-07T20:32:17.1306004Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.1306213Z 
2025-05-07T20:32:17.1306317Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.1306723Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.1307116Z     T=128,
2025-05-07T20:32:17.1307297Z     D=7168,
2025-05-07T20:32:17.1307482Z     scale_ub=1200.0,
2025-05-07T20:32:17.1307690Z     contiguous=True,
2025-05-07T20:32:17.1307907Z     compiled=True,
2025-05-07T20:32:17.1308104Z )
2025-05-07T20:32:17.1762770Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.1763279Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:17.1763570Z 
2025-05-07T20:32:17.1763655Z     @given(
2025-05-07T20:32:17.1763882Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.1764196Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.1764502Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.1764824Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.1765155Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.1765442Z     )
2025-05-07T20:32:17.1765791Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.1766227Z     def test_silu_mul_quant(
2025-05-07T20:32:17.1766473Z         self,
2025-05-07T20:32:17.1766670Z         T: int,
2025-05-07T20:32:17.1766860Z         D: int,
2025-05-07T20:32:17.1767082Z         scale_ub: Optional[float],
2025-05-07T20:32:17.1767351Z         contiguous: bool,
2025-05-07T20:32:17.1767587Z         compiled: bool,
2025-05-07T20:32:17.1767803Z     ) -> None:
2025-05-07T20:32:17.1768015Z         torch.manual_seed(2025)
2025-05-07T20:32:17.1768245Z     
2025-05-07T20:32:17.1768508Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.1768837Z     
2025-05-07T20:32:17.1769022Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.1769303Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.1769635Z         x = x_sign * x_clamp
2025-05-07T20:32:17.1769895Z         x0 = x[:, :D]
2025-05-07T20:32:17.1770111Z         x1 = x[:, D:]
2025-05-07T20:32:17.1770317Z     
2025-05-07T20:32:17.1770493Z         if contiguous:
2025-05-07T20:32:17.1770717Z             x0 = x0.contiguous()
2025-05-07T20:32:17.1770974Z             x1 = x1.contiguous()
2025-05-07T20:32:17.1771207Z     
2025-05-07T20:32:17.1771387Z         if scale_ub is not None:
2025-05-07T20:32:17.1771649Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.1771984Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.1772282Z             )
2025-05-07T20:32:17.1772467Z         else:
2025-05-07T20:32:17.1772670Z             scale_ub_tensor = None
2025-05-07T20:32:17.1772910Z     
2025-05-07T20:32:17.1773135Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.1773444Z             op = silu_mul_quant
2025-05-07T20:32:17.1773681Z             if compiled:
2025-05-07T20:32:17.1773923Z                 op = torch.compile(op)
2025-05-07T20:32:17.1774213Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.1774477Z     
2025-05-07T20:32:17.1774662Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.1774828Z 
2025-05-07T20:32:17.1775033Z moe/activation_test.py:117: 
2025-05-07T20:32:17.1775323Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.1775643Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.1775916Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.1776578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.1777129Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.1777782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.1778521Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.1779048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.1779718Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.1780409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.1780931Z     kernel = self.compile(
2025-05-07T20:32:17.1781459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.1782109Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.1782496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.1782718Z 
2025-05-07T20:32:17.1782928Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a047f10>
2025-05-07T20:32:17.1784000Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.1785363Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a3a4280>}
2025-05-07T20:32:17.1786701Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.1787714Z context = <triton._C.libtriton.ir.context object at 0x7feae116fa70>
2025-05-07T20:32:17.1787996Z 
2025-05-07T20:32:17.1788161Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.1788667Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.1789129Z                            module_map=module_map)
2025-05-07T20:32:17.1789493Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.1789896Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.1790147Z E       ^
2025-05-07T20:32:17.1790603Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.1791050Z 
2025-05-07T20:32:17.1791470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.1791994Z 
2025-05-07T20:32:17.1792105Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.1792507Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.1792902Z     T=128,
2025-05-07T20:32:17.1793080Z     D=7168,
2025-05-07T20:32:17.1793263Z     scale_ub=1200.0,
2025-05-07T20:32:17.1793483Z     contiguous=True,
2025-05-07T20:32:17.1793707Z     compiled=False,
2025-05-07T20:32:17.1793904Z )
2025-05-07T20:32:17.1794213Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.1794692Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:17.1794959Z 
2025-05-07T20:32:17.1795038Z     @given(
2025-05-07T20:32:17.1795256Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.1795612Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.1795916Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.1796232Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.1796635Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.1796912Z     )
2025-05-07T20:32:17.1797248Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.1797683Z     def test_silu_mul_quant(
2025-05-07T20:32:17.1797917Z         self,
2025-05-07T20:32:17.1798167Z         T: int,
2025-05-07T20:32:17.1798361Z         D: int,
2025-05-07T20:32:17.1798580Z         scale_ub: Optional[float],
2025-05-07T20:32:17.1798851Z         contiguous: bool,
2025-05-07T20:32:17.1799082Z         compiled: bool,
2025-05-07T20:32:17.1799299Z     ) -> None:
2025-05-07T20:32:17.1799508Z         torch.manual_seed(2025)
2025-05-07T20:32:17.1799739Z     
2025-05-07T20:32:17.1800007Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.1800337Z     
2025-05-07T20:32:17.1800519Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.1800802Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.1802802Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.1804816Z 
2025-05-07T20:32:17.1804937Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:17.1805149Z 
2025-05-07T20:32:17.1805257Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.1805661Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.1806066Z     T=128,
2025-05-07T20:32:17.1806243Z     D=5120,
2025-05-07T20:32:17.1806421Z     scale_ub=1200.0,
2025-05-07T20:32:17.1806641Z     contiguous=True,
2025-05-07T20:32:17.1806853Z     compiled=True,
2025-05-07T20:32:17.1807044Z )
2025-05-07T20:32:17.1807349Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.1807828Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:17.1808094Z 
2025-05-07T20:32:17.1808166Z     @given(
2025-05-07T20:32:17.1808385Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.1808688Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.1808988Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.1809303Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.1809628Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.1809911Z     )
2025-05-07T20:32:17.1810247Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.1810678Z     def test_silu_mul_quant(
2025-05-07T20:32:17.1810919Z         self,
2025-05-07T20:32:17.1811121Z         T: int,
2025-05-07T20:32:17.1811306Z         D: int,
2025-05-07T20:32:17.1811515Z         scale_ub: Optional[float],
2025-05-07T20:32:17.1811781Z         contiguous: bool,
2025-05-07T20:32:17.1812008Z         compiled: bool,
2025-05-07T20:32:17.1812226Z     ) -> None:
2025-05-07T20:32:17.1812433Z         torch.manual_seed(2025)
2025-05-07T20:32:17.1812666Z     
2025-05-07T20:32:17.1812930Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.1813265Z     
2025-05-07T20:32:17.1813450Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.1813726Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.1815827Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.1817730Z 
2025-05-07T20:32:17.1817848Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:17.1818113Z 
2025-05-07T20:32:17.1818217Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.1818620Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.1819021Z     T=128,
2025-05-07T20:32:17.1819200Z     D=7168,
2025-05-07T20:32:17.1819380Z     scale_ub=None,
2025-05-07T20:32:17.1819586Z     contiguous=True,
2025-05-07T20:32:17.1819804Z     compiled=True,
2025-05-07T20:32:17.1819999Z )
2025-05-07T20:32:17.4298200Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.4298712Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.4298993Z 
2025-05-07T20:32:17.4299075Z     @given(
2025-05-07T20:32:17.4299301Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.4299605Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.4299906Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.4300270Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.4300608Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.4300888Z     )
2025-05-07T20:32:17.4301228Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.4301664Z     def test_silu_mul_quant(
2025-05-07T20:32:17.4301905Z         self,
2025-05-07T20:32:17.4302097Z         T: int,
2025-05-07T20:32:17.4302294Z         D: int,
2025-05-07T20:32:17.4302512Z         scale_ub: Optional[float],
2025-05-07T20:32:17.4302778Z         contiguous: bool,
2025-05-07T20:32:17.4303021Z         compiled: bool,
2025-05-07T20:32:17.4303248Z     ) -> None:
2025-05-07T20:32:17.4303462Z         torch.manual_seed(2025)
2025-05-07T20:32:17.4303845Z     
2025-05-07T20:32:17.4304115Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.4306156Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.4308002Z 
2025-05-07T20:32:17.4308128Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.4308336Z 
2025-05-07T20:32:17.4339375Z FAILED
2025-05-07T20:32:17.4339619Z 
2025-05-07T20:32:17.4339805Z =================================== FAILURES ===================================
2025-05-07T20:32:17.4340256Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:17.4340791Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:17.4341655Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 59, in testPartExecutor
2025-05-07T20:32:17.4342398Z   |     yield
2025-05-07T20:32:17.4342970Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 592, in run
2025-05-07T20:32:17.4343673Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:17.4344431Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 550, in _callTestMethod
2025-05-07T20:32:17.4345327Z   |     method()
2025-05-07T20:32:17.4346340Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:17.4347373Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.4348244Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:17.4349086Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:17.4350189Z   | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:17.4350846Z   +-+---------------- 1 ----------------
2025-05-07T20:32:17.4351232Z     | Traceback (most recent call last):
2025-05-07T20:32:17.4352195Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:17.4353253Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.4356085Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.4358799Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:17.4359388Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.4359948Z     |     T=2048,
2025-05-07T20:32:17.4360271Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:17.4360725Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:17.4361200Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:17.4361699Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:17.4362119Z     | )
2025-05-07T20:32:17.4362346Z     | 
2025-05-07T20:32:17.4363047Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:17.4363872Z     +---------------- 2 ----------------
2025-05-07T20:32:17.4364259Z     | Traceback (most recent call last):
2025-05-07T20:32:17.4365219Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:17.4366291Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.4369158Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.4371891Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:17.4372491Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.4373029Z     |     T=128,
2025-05-07T20:32:17.4373301Z     |     D=7168,
2025-05-07T20:32:17.4373522Z     |     scale_ub=None,
2025-05-07T20:32:17.4373794Z     |     contiguous=True,
2025-05-07T20:32:17.4374141Z     |     compiled=True,
2025-05-07T20:32:17.4374526Z     | )
2025-05-07T20:32:17.4374753Z     | 
2025-05-07T20:32:17.4375404Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:17.4376090Z     +---------------- 3 ----------------
2025-05-07T20:32:17.4376373Z     | Traceback (most recent call last):
2025-05-07T20:32:17.4377068Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:17.4377835Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.4379969Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.4381959Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:17.4382395Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.4382802Z     |     T=128,
2025-05-07T20:32:17.4382994Z     |     D=5120,
2025-05-07T20:32:17.4383199Z     |     scale_ub=1200.0,
2025-05-07T20:32:17.4383432Z     |     contiguous=True,
2025-05-07T20:32:17.4383668Z     |     compiled=True,
2025-05-07T20:32:17.4383888Z     | )
2025-05-07T20:32:17.4384052Z     | 
2025-05-07T20:32:17.4384573Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:17.4385176Z     +---------------- 4 ----------------
2025-05-07T20:32:17.4385457Z     | Traceback (most recent call last):
2025-05-07T20:32:17.4386309Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:17.4387365Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.4388297Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:17.4389276Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.4390622Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:17.4391809Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.4392656Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:17.4393656Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.4394678Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:17.4395752Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.4396858Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:32:17.4397956Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.4399044Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:17.4400005Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.4400893Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:17.4401810Z     |     fn()
2025-05-07T20:32:17.4402698Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:17.4403570Z     |     self.fn.run(
2025-05-07T20:32:17.4404689Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:17.4405514Z     |     kernel = self.compile(
2025-05-07T20:32:17.4406567Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:17.4428640Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.4429673Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:17.4430903Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.4431626Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.4432112Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.4432472Z     | ^
2025-05-07T20:32:17.4433135Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.4433925Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:17.4434490Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:17.4435196Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.4435784Z     |     T=1,  # or any other generated value
2025-05-07T20:32:17.4436212Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:17.4436655Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:17.4437145Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:17.4437629Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:17.4438019Z     | )
2025-05-07T20:32:17.4438260Z     | 
2025-05-07T20:32:17.4438970Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:17.4439803Z     +------------------------------------
2025-05-07T20:32:17.4440295Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:17.4440817Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.4441395Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.4441939Z     T=1,
2025-05-07T20:32:17.4442196Z     D=5120,
2025-05-07T20:32:17.4442464Z     scale_ub=None,
2025-05-07T20:32:17.4442761Z     contiguous=True,
2025-05-07T20:32:17.4443074Z     compiled=True,
2025-05-07T20:32:17.4443362Z )
2025-05-07T20:32:17.4443798Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.4444468Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.4444840Z 
2025-05-07T20:32:17.4444935Z     @given(
2025-05-07T20:32:17.4445241Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.4445661Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.4446091Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.4446563Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.4447003Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.4447399Z     )
2025-05-07T20:32:17.4447895Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.4448502Z     def test_silu_mul_quant(
2025-05-07T20:32:17.4448828Z         self,
2025-05-07T20:32:17.4449087Z         T: int,
2025-05-07T20:32:17.4449351Z         D: int,
2025-05-07T20:32:17.4449861Z         scale_ub: Optional[float],
2025-05-07T20:32:17.4450247Z         contiguous: bool,
2025-05-07T20:32:17.4450567Z         compiled: bool,
2025-05-07T20:32:17.4450869Z     ) -> None:
2025-05-07T20:32:17.4451158Z         torch.manual_seed(2025)
2025-05-07T20:32:17.4451485Z     
2025-05-07T20:32:17.4451982Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.4452454Z     
2025-05-07T20:32:17.4452728Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.4453097Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.4453516Z         x = x_sign * x_clamp
2025-05-07T20:32:17.4453914Z         x0 = x[:, :D]
2025-05-07T20:32:17.4454192Z         x1 = x[:, D:]
2025-05-07T20:32:17.4454481Z     
2025-05-07T20:32:17.4454737Z         if contiguous:
2025-05-07T20:32:17.4455048Z             x0 = x0.contiguous()
2025-05-07T20:32:17.4455397Z             x1 = x1.contiguous()
2025-05-07T20:32:17.4455719Z     
2025-05-07T20:32:17.4455966Z         if scale_ub is not None:
2025-05-07T20:32:17.4456332Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.4456780Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.4457189Z             )
2025-05-07T20:32:17.4457438Z         else:
2025-05-07T20:32:17.4457721Z             scale_ub_tensor = None
2025-05-07T20:32:17.4458061Z     
2025-05-07T20:32:17.4458375Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4458810Z             op = silu_mul_quant
2025-05-07T20:32:17.4459162Z             if compiled:
2025-05-07T20:32:17.4459500Z                 op = torch.compile(op)
2025-05-07T20:32:17.4459924Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4460306Z     
2025-05-07T20:32:17.4460568Z         y_fp8, y_scale = fn()
2025-05-07T20:32:17.4460956Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:17.4461362Z     
2025-05-07T20:32:17.4461692Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4462160Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:17.4462565Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:17.4462993Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:17.4463498Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.4463926Z     
2025-05-07T20:32:17.4464198Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.4464469Z 
2025-05-07T20:32:17.4464604Z moe/activation_test.py:126: 
2025-05-07T20:32:17.4465017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4465483Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:17.4465941Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.4467053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:17.4468112Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.4468874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.4469791Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.4470803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:17.4471742Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.4472731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:17.4473724Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.4474695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:17.4475549Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.4476417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:17.4477130Z     fn()
2025-05-07T20:32:17.4477938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:17.4478754Z     self.fn.run(
2025-05-07T20:32:17.4479391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.4480129Z     kernel = self.compile(
2025-05-07T20:32:17.4480877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.4481788Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.4482297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4482602Z 
2025-05-07T20:32:17.4482867Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2ecb9040>
2025-05-07T20:32:17.4484294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.4486146Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2ece7040>}
2025-05-07T20:32:17.4487915Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.4489279Z context = <triton._C.libtriton.ir.context object at 0x7feb2f8a4f30>
2025-05-07T20:32:17.4489669Z 
2025-05-07T20:32:17.4489886Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.4490572Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.4491203Z                            module_map=module_map)
2025-05-07T20:32:17.4491694Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.4492160Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.4492525Z E       ^
2025-05-07T20:32:17.4493150Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.4493773Z 
2025-05-07T20:32:17.4494339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.4495017Z 
2025-05-07T20:32:17.4495162Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.4495722Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.4496248Z     T=2048,
2025-05-07T20:32:17.4496493Z     D=5120,
2025-05-07T20:32:17.4496740Z     scale_ub=1200.0,
2025-05-07T20:32:17.4497026Z     contiguous=True,
2025-05-07T20:32:17.4497326Z     compiled=False,
2025-05-07T20:32:17.4497604Z )
2025-05-07T20:32:17.4498031Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.4498720Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:17.4499098Z 
2025-05-07T20:32:17.4499210Z     @given(
2025-05-07T20:32:17.4499512Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.4499932Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.4500342Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.4500782Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.4501212Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.4501590Z     )
2025-05-07T20:32:17.4502056Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.4502643Z     def test_silu_mul_quant(
2025-05-07T20:32:17.4502965Z         self,
2025-05-07T20:32:17.4503279Z         T: int,
2025-05-07T20:32:17.4503527Z         D: int,
2025-05-07T20:32:17.4504107Z         scale_ub: Optional[float],
2025-05-07T20:32:17.4504479Z         contiguous: bool,
2025-05-07T20:32:17.4504802Z         compiled: bool,
2025-05-07T20:32:17.4505268Z     ) -> None:
2025-05-07T20:32:17.4505555Z         torch.manual_seed(2025)
2025-05-07T20:32:17.4505865Z     
2025-05-07T20:32:17.4506236Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.4506710Z     
2025-05-07T20:32:17.4506962Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.4507443Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.4507860Z         x = x_sign * x_clamp
2025-05-07T20:32:17.4508182Z         x0 = x[:, :D]
2025-05-07T20:32:17.4508465Z         x1 = x[:, D:]
2025-05-07T20:32:17.4508745Z     
2025-05-07T20:32:17.4508991Z         if contiguous:
2025-05-07T20:32:17.4509300Z             x0 = x0.contiguous()
2025-05-07T20:32:17.4509652Z             x1 = x1.contiguous()
2025-05-07T20:32:17.4510080Z     
2025-05-07T20:32:17.4510331Z         if scale_ub is not None:
2025-05-07T20:32:17.4510695Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.4511155Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.4511560Z             )
2025-05-07T20:32:17.4511818Z         else:
2025-05-07T20:32:17.4512100Z             scale_ub_tensor = None
2025-05-07T20:32:17.4512445Z     
2025-05-07T20:32:17.4512755Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4513180Z             op = silu_mul_quant
2025-05-07T20:32:17.4513518Z             if compiled:
2025-05-07T20:32:17.4513846Z                 op = torch.compile(op)
2025-05-07T20:32:17.4514250Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4514619Z     
2025-05-07T20:32:17.4514876Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.4515107Z 
2025-05-07T20:32:17.4515233Z moe/activation_test.py:117: 
2025-05-07T20:32:17.4515642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4516089Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.4516481Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4517431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.4518364Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.4519093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.4519977Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.4520825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.4521516Z     kernel = self.compile(
2025-05-07T20:32:17.4522271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.4523171Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.4523705Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4524010Z 
2025-05-07T20:32:17.4524287Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2f81e520>
2025-05-07T20:32:17.4525759Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.4527624Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2f1a53a0>}
2025-05-07T20:32:17.4529470Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.4530977Z context = <triton._C.libtriton.ir.context object at 0x7feb2da08a30>
2025-05-07T20:32:17.4531269Z 
2025-05-07T20:32:17.4531444Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.4532074Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.4532547Z                            module_map=module_map)
2025-05-07T20:32:17.4532914Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.4533262Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.4533559Z E       ^
2025-05-07T20:32:17.4534024Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.4534476Z 
2025-05-07T20:32:17.4534901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.4535416Z 
2025-05-07T20:32:17.4535517Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.4535938Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.4536348Z     T=2048,
2025-05-07T20:32:17.4536529Z     D=5120,
2025-05-07T20:32:17.4536729Z     scale_ub=1200.0,
2025-05-07T20:32:17.4536953Z     contiguous=True,
2025-05-07T20:32:17.4537164Z     compiled=True,
2025-05-07T20:32:17.4537364Z )
2025-05-07T20:32:17.4537686Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.4538179Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:17.4538449Z 
2025-05-07T20:32:17.4538527Z     @given(
2025-05-07T20:32:17.4538754Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.4539063Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.4539362Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.4539690Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.4540020Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.4540298Z     )
2025-05-07T20:32:17.4540644Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.4541086Z     def test_silu_mul_quant(
2025-05-07T20:32:17.4541324Z         self,
2025-05-07T20:32:17.4541516Z         T: int,
2025-05-07T20:32:17.4541714Z         D: int,
2025-05-07T20:32:17.4541935Z         scale_ub: Optional[float],
2025-05-07T20:32:17.4542201Z         contiguous: bool,
2025-05-07T20:32:17.4542437Z         compiled: bool,
2025-05-07T20:32:17.4542663Z     ) -> None:
2025-05-07T20:32:17.4542870Z         torch.manual_seed(2025)
2025-05-07T20:32:17.4543112Z     
2025-05-07T20:32:17.4543384Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.4543722Z     
2025-05-07T20:32:17.4543907Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.4544196Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.4544502Z         x = x_sign * x_clamp
2025-05-07T20:32:17.4544740Z         x0 = x[:, :D]
2025-05-07T20:32:17.4544954Z         x1 = x[:, D:]
2025-05-07T20:32:17.4545154Z     
2025-05-07T20:32:17.4545335Z         if contiguous:
2025-05-07T20:32:17.4545569Z             x0 = x0.contiguous()
2025-05-07T20:32:17.4545822Z             x1 = x1.contiguous()
2025-05-07T20:32:17.4546064Z     
2025-05-07T20:32:17.4546258Z         if scale_ub is not None:
2025-05-07T20:32:17.4546524Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.4546861Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.4547174Z             )
2025-05-07T20:32:17.4547366Z         else:
2025-05-07T20:32:17.4547573Z             scale_ub_tensor = None
2025-05-07T20:32:17.4547825Z     
2025-05-07T20:32:17.4548054Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4548363Z             op = silu_mul_quant
2025-05-07T20:32:17.4548611Z             if compiled:
2025-05-07T20:32:17.4548911Z                 op = torch.compile(op)
2025-05-07T20:32:17.4549201Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4549479Z     
2025-05-07T20:32:17.4549671Z         y_fp8, y_scale = fn()
2025-05-07T20:32:17.4550110Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:17.4550403Z     
2025-05-07T20:32:17.4550634Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4550962Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:17.4551259Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:17.4551576Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:17.4551975Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.4552278Z     
2025-05-07T20:32:17.4552481Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.4552674Z 
2025-05-07T20:32:17.4552777Z moe/activation_test.py:126: 
2025-05-07T20:32:17.4553067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4553405Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:17.4553735Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.4554532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:17.4555301Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.4555847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.4556531Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.4557218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:17.4557939Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.4558696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:17.4559446Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.4560173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:17.4560813Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.4561416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:17.4561931Z     fn()
2025-05-07T20:32:17.4562432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:17.4563012Z     self.fn.run(
2025-05-07T20:32:17.4563482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.4564005Z     kernel = self.compile(
2025-05-07T20:32:17.4564550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.4565206Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.4565608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4565839Z 
2025-05-07T20:32:17.4566045Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2d91d220>
2025-05-07T20:32:17.4567142Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.4568536Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2d89f670>}
2025-05-07T20:32:17.4569897Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.4570978Z context = <triton._C.libtriton.ir.context object at 0x7feb2d64bbf0>
2025-05-07T20:32:17.4571277Z 
2025-05-07T20:32:17.4571519Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.4572054Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.4572527Z                            module_map=module_map)
2025-05-07T20:32:17.4572891Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.4573296Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.4573568Z E       ^
2025-05-07T20:32:17.4574035Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.4574496Z 
2025-05-07T20:32:17.4574912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.4575431Z 
2025-05-07T20:32:17.4575533Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.4575949Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.4576354Z     T=16384,
2025-05-07T20:32:17.4576543Z     D=7168,
2025-05-07T20:32:17.4576733Z     scale_ub=1200.0,
2025-05-07T20:32:17.4576948Z     contiguous=False,
2025-05-07T20:32:17.4577176Z     compiled=False,
2025-05-07T20:32:17.4577377Z )
2025-05-07T20:32:17.4577688Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.4578188Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:17.4578472Z 
2025-05-07T20:32:17.4578550Z     @given(
2025-05-07T20:32:17.4578777Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.4579084Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.4579396Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.4579726Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.4580052Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.4580337Z     )
2025-05-07T20:32:17.4580691Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.4581128Z     def test_silu_mul_quant(
2025-05-07T20:32:17.4581370Z         self,
2025-05-07T20:32:17.4581566Z         T: int,
2025-05-07T20:32:17.4581762Z         D: int,
2025-05-07T20:32:17.4581974Z         scale_ub: Optional[float],
2025-05-07T20:32:17.4582488Z         contiguous: bool,
2025-05-07T20:32:17.4582730Z         compiled: bool,
2025-05-07T20:32:17.4582948Z     ) -> None:
2025-05-07T20:32:17.4583162Z         torch.manual_seed(2025)
2025-05-07T20:32:17.4583408Z     
2025-05-07T20:32:17.4583673Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.4584010Z     
2025-05-07T20:32:17.4584209Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.4584493Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.4584800Z         x = x_sign * x_clamp
2025-05-07T20:32:17.4585040Z         x0 = x[:, :D]
2025-05-07T20:32:17.4585254Z         x1 = x[:, D:]
2025-05-07T20:32:17.4585465Z     
2025-05-07T20:32:17.4585654Z         if contiguous:
2025-05-07T20:32:17.4585877Z             x0 = x0.contiguous()
2025-05-07T20:32:17.4586136Z             x1 = x1.contiguous()
2025-05-07T20:32:17.4586377Z     
2025-05-07T20:32:17.4586562Z         if scale_ub is not None:
2025-05-07T20:32:17.4586841Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.4587181Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.4587492Z             )
2025-05-07T20:32:17.4587677Z         else:
2025-05-07T20:32:17.4587887Z             scale_ub_tensor = None
2025-05-07T20:32:17.4588133Z     
2025-05-07T20:32:17.4588354Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4588719Z             op = silu_mul_quant
2025-05-07T20:32:17.4588965Z             if compiled:
2025-05-07T20:32:17.4589202Z                 op = torch.compile(op)
2025-05-07T20:32:17.4589499Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4589769Z     
2025-05-07T20:32:17.4590097Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.4590266Z 
2025-05-07T20:32:17.4590362Z moe/activation_test.py:117: 
2025-05-07T20:32:17.4590652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4590978Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.4591288Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4591975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.4592663Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.4593193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.4593873Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.4594530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.4595063Z     kernel = self.compile(
2025-05-07T20:32:17.4595589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.4596236Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.4596633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4596860Z 
2025-05-07T20:32:17.4597065Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2d779790>
2025-05-07T20:32:17.4598150Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.4599529Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2d88c280>}
2025-05-07T20:32:17.4600930Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.4601952Z context = <triton._C.libtriton.ir.context object at 0x7feb2d5e2630>
2025-05-07T20:32:17.4602241Z 
2025-05-07T20:32:17.4602402Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.4602919Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.4603383Z                            module_map=module_map)
2025-05-07T20:32:17.4603943Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.4604386Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.4604642Z E       ^
2025-05-07T20:32:17.4605113Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.4605562Z 
2025-05-07T20:32:17.4614226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.4614776Z 
2025-05-07T20:32:17.4614891Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.4615306Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.4615722Z     T=1,
2025-05-07T20:32:17.4615911Z     D=7168,
2025-05-07T20:32:17.4616097Z     scale_ub=None,
2025-05-07T20:32:17.4616316Z     contiguous=True,
2025-05-07T20:32:17.4616542Z     compiled=True,
2025-05-07T20:32:17.4616740Z )
2025-05-07T20:32:17.4617055Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.4617660Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.4617916Z 
2025-05-07T20:32:17.4617999Z     @given(
2025-05-07T20:32:17.4618221Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.4618533Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.4618985Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.4619310Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.4619636Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.4619921Z     )
2025-05-07T20:32:17.4620261Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.4620769Z     def test_silu_mul_quant(
2025-05-07T20:32:17.4621016Z         self,
2025-05-07T20:32:17.4621205Z         T: int,
2025-05-07T20:32:17.4621398Z         D: int,
2025-05-07T20:32:17.4621616Z         scale_ub: Optional[float],
2025-05-07T20:32:17.4621884Z         contiguous: bool,
2025-05-07T20:32:17.4622118Z         compiled: bool,
2025-05-07T20:32:17.4622341Z     ) -> None:
2025-05-07T20:32:17.4622557Z         torch.manual_seed(2025)
2025-05-07T20:32:17.4622789Z     
2025-05-07T20:32:17.4623058Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.4623396Z     
2025-05-07T20:32:17.4623585Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.4623873Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.4624177Z         x = x_sign * x_clamp
2025-05-07T20:32:17.4624413Z         x0 = x[:, :D]
2025-05-07T20:32:17.4624628Z         x1 = x[:, D:]
2025-05-07T20:32:17.4624839Z     
2025-05-07T20:32:17.4625013Z         if contiguous:
2025-05-07T20:32:17.4625245Z             x0 = x0.contiguous()
2025-05-07T20:32:17.4625504Z             x1 = x1.contiguous()
2025-05-07T20:32:17.4625740Z     
2025-05-07T20:32:17.4625929Z         if scale_ub is not None:
2025-05-07T20:32:17.4626202Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.4626529Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.4626843Z             )
2025-05-07T20:32:17.4627037Z         else:
2025-05-07T20:32:17.4627247Z             scale_ub_tensor = None
2025-05-07T20:32:17.4627495Z     
2025-05-07T20:32:17.4627730Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4628045Z             op = silu_mul_quant
2025-05-07T20:32:17.4628288Z             if compiled:
2025-05-07T20:32:17.4628542Z                 op = torch.compile(op)
2025-05-07T20:32:17.4628839Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4629107Z     
2025-05-07T20:32:17.4629301Z         y_fp8, y_scale = fn()
2025-05-07T20:32:17.4629582Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:17.4629969Z     
2025-05-07T20:32:17.4630240Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4630573Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:17.4630856Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:17.4631170Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:17.4631529Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.4631829Z     
2025-05-07T20:32:17.4632027Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.4632223Z 
2025-05-07T20:32:17.4632328Z moe/activation_test.py:126: 
2025-05-07T20:32:17.4632618Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4632948Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:17.4633270Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.4634062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:17.4634818Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.4635360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.4636111Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.4636796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:17.4637584Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.4638339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:17.4639086Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.4639843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:17.4640480Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.4641078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:17.4641596Z     fn()
2025-05-07T20:32:17.4642093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:17.4642676Z     self.fn.run(
2025-05-07T20:32:17.4643143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.4643665Z     kernel = self.compile(
2025-05-07T20:32:17.4644198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.4644845Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.4645238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4645463Z 
2025-05-07T20:32:17.4645667Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2d77cdc0>
2025-05-07T20:32:17.4646752Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.4648141Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2d89f5e0>}
2025-05-07T20:32:17.4649486Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.4650513Z context = <triton._C.libtriton.ir.context object at 0x7feb2cfd3ef0>
2025-05-07T20:32:17.4650798Z 
2025-05-07T20:32:17.4650962Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.4651484Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.4651949Z                            module_map=module_map)
2025-05-07T20:32:17.4652307Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.4652660Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.4652923Z E       ^
2025-05-07T20:32:17.4653397Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.4653845Z 
2025-05-07T20:32:17.4654257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.4654770Z 
2025-05-07T20:32:17.4654870Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.4655282Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.4655686Z     T=4096,
2025-05-07T20:32:17.4655862Z     D=5120,
2025-05-07T20:32:17.4656048Z     scale_ub=None,
2025-05-07T20:32:17.4656258Z     contiguous=False,
2025-05-07T20:32:17.4656476Z     compiled=False,
2025-05-07T20:32:17.4656674Z )
2025-05-07T20:32:17.4657037Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.4657521Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:17.4657797Z 
2025-05-07T20:32:17.4657873Z     @given(
2025-05-07T20:32:17.4658175Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.4658483Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.4658787Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.4659114Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.4659440Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.4659755Z     )
2025-05-07T20:32:17.4660104Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.4660541Z     def test_silu_mul_quant(
2025-05-07T20:32:17.4660773Z         self,
2025-05-07T20:32:17.4660964Z         T: int,
2025-05-07T20:32:17.4661156Z         D: int,
2025-05-07T20:32:17.4661368Z         scale_ub: Optional[float],
2025-05-07T20:32:17.4661639Z         contiguous: bool,
2025-05-07T20:32:17.4661872Z         compiled: bool,
2025-05-07T20:32:17.4662088Z     ) -> None:
2025-05-07T20:32:17.4662307Z         torch.manual_seed(2025)
2025-05-07T20:32:17.4662543Z     
2025-05-07T20:32:17.4662809Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.4663152Z     
2025-05-07T20:32:17.4663345Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.4663624Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.4663926Z         x = x_sign * x_clamp
2025-05-07T20:32:17.4664174Z         x0 = x[:, :D]
2025-05-07T20:32:17.4664388Z         x1 = x[:, D:]
2025-05-07T20:32:17.4664583Z     
2025-05-07T20:32:17.4664762Z         if contiguous:
2025-05-07T20:32:17.4664992Z             x0 = x0.contiguous()
2025-05-07T20:32:17.4665235Z             x1 = x1.contiguous()
2025-05-07T20:32:17.4665475Z     
2025-05-07T20:32:17.4665663Z         if scale_ub is not None:
2025-05-07T20:32:17.4665927Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.4666256Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.4666560Z             )
2025-05-07T20:32:17.4666742Z         else:
2025-05-07T20:32:17.4666950Z             scale_ub_tensor = None
2025-05-07T20:32:17.4667193Z     
2025-05-07T20:32:17.4667412Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4667719Z             op = silu_mul_quant
2025-05-07T20:32:17.4667965Z             if compiled:
2025-05-07T20:32:17.4668201Z                 op = torch.compile(op)
2025-05-07T20:32:17.4668498Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4668764Z     
2025-05-07T20:32:17.4668947Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.4669113Z 
2025-05-07T20:32:17.4669209Z moe/activation_test.py:117: 
2025-05-07T20:32:17.4669496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4669879Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.4670154Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4670839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.4671533Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.4672063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.4672747Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.4673403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.4673930Z     kernel = self.compile(
2025-05-07T20:32:17.4674462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.4675110Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.4675552Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4675776Z 
2025-05-07T20:32:17.4675987Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2f3c4d60>
2025-05-07T20:32:17.4677142Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.4678521Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2d1f3550>}
2025-05-07T20:32:17.4679902Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.4680928Z context = <triton._C.libtriton.ir.context object at 0x7feb2cfebc30>
2025-05-07T20:32:17.4681215Z 
2025-05-07T20:32:17.4681377Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.4681896Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.4682363Z                            module_map=module_map)
2025-05-07T20:32:17.4682722Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.4683068Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.4683321Z E       ^
2025-05-07T20:32:17.4683785Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.4684240Z 
2025-05-07T20:32:17.4684660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.4685170Z 
2025-05-07T20:32:17.4685269Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.4685675Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.4686078Z     T=4096,
2025-05-07T20:32:17.4686250Z     D=7168,
2025-05-07T20:32:17.4686439Z     scale_ub=None,
2025-05-07T20:32:17.4686648Z     contiguous=False,
2025-05-07T20:32:17.4686862Z     compiled=False,
2025-05-07T20:32:17.4687063Z )
2025-05-07T20:32:17.4687380Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.4687866Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:17.4688138Z 
2025-05-07T20:32:17.4688217Z     @given(
2025-05-07T20:32:17.4688442Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.4688754Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.4689054Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.4689373Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.4689701Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.4690013Z     )
2025-05-07T20:32:17.4690369Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.4690803Z     def test_silu_mul_quant(
2025-05-07T20:32:17.4691036Z         self,
2025-05-07T20:32:17.4691219Z         T: int,
2025-05-07T20:32:17.4691411Z         D: int,
2025-05-07T20:32:17.4691631Z         scale_ub: Optional[float],
2025-05-07T20:32:17.4691899Z         contiguous: bool,
2025-05-07T20:32:17.4692126Z         compiled: bool,
2025-05-07T20:32:17.4692340Z     ) -> None:
2025-05-07T20:32:17.4692544Z         torch.manual_seed(2025)
2025-05-07T20:32:17.4692767Z     
2025-05-07T20:32:17.4693036Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.4693368Z     
2025-05-07T20:32:17.4693547Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.4693831Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.4694136Z         x = x_sign * x_clamp
2025-05-07T20:32:17.4694360Z         x0 = x[:, :D]
2025-05-07T20:32:17.4694649Z         x1 = x[:, D:]
2025-05-07T20:32:17.4694849Z     
2025-05-07T20:32:17.4695022Z         if contiguous:
2025-05-07T20:32:17.4695244Z             x0 = x0.contiguous()
2025-05-07T20:32:17.4695495Z             x1 = x1.contiguous()
2025-05-07T20:32:17.4695722Z     
2025-05-07T20:32:17.4695983Z         if scale_ub is not None:
2025-05-07T20:32:17.4696250Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.4696571Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.4696872Z             )
2025-05-07T20:32:17.4697059Z         else:
2025-05-07T20:32:17.4697258Z             scale_ub_tensor = None
2025-05-07T20:32:17.4697543Z     
2025-05-07T20:32:17.4697767Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4698075Z             op = silu_mul_quant
2025-05-07T20:32:17.4698314Z             if compiled:
2025-05-07T20:32:17.4698558Z                 op = torch.compile(op)
2025-05-07T20:32:17.4698850Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4699113Z     
2025-05-07T20:32:17.4699296Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.4699459Z 
2025-05-07T20:32:17.4699559Z moe/activation_test.py:117: 
2025-05-07T20:32:17.4699844Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4700179Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.4700454Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4701139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.4701824Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.4702356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.4703034Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.4703680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.4704500Z     kernel = self.compile(
2025-05-07T20:32:17.4705044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.4705697Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.4706086Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4706314Z 
2025-05-07T20:32:17.4706516Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2d391160>
2025-05-07T20:32:17.4707599Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.4708982Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2d096700>}
2025-05-07T20:32:17.4710429Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.4711459Z context = <triton._C.libtriton.ir.context object at 0x7feb2cb51cb0>
2025-05-07T20:32:17.4711748Z 
2025-05-07T20:32:17.4711913Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.4712437Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.4712901Z                            module_map=module_map)
2025-05-07T20:32:17.4713263Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.4713615Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.4713868Z E       ^
2025-05-07T20:32:17.4714329Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.4714878Z 
2025-05-07T20:32:17.4715293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.4715799Z 
2025-05-07T20:32:17.4715907Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.4716424Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.4716829Z     T=128,
2025-05-07T20:32:17.4717012Z     D=7168,
2025-05-07T20:32:17.4717201Z     scale_ub=None,
2025-05-07T20:32:17.4717410Z     contiguous=False,
2025-05-07T20:32:17.4717631Z     compiled=True,
2025-05-07T20:32:17.4717890Z )
2025-05-07T20:32:17.4718196Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.4718681Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:17.4718943Z 
2025-05-07T20:32:17.4719019Z     @given(
2025-05-07T20:32:17.4719235Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.4719543Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.4719849Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.4720215Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.4720541Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.4720829Z     )
2025-05-07T20:32:17.4721171Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.4721603Z     def test_silu_mul_quant(
2025-05-07T20:32:17.4721838Z         self,
2025-05-07T20:32:17.4722024Z         T: int,
2025-05-07T20:32:17.4722209Z         D: int,
2025-05-07T20:32:17.4722422Z         scale_ub: Optional[float],
2025-05-07T20:32:17.4722684Z         contiguous: bool,
2025-05-07T20:32:17.4722913Z         compiled: bool,
2025-05-07T20:32:17.4723129Z     ) -> None:
2025-05-07T20:32:17.4723338Z         torch.manual_seed(2025)
2025-05-07T20:32:17.4723571Z     
2025-05-07T20:32:17.4723834Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.4724172Z     
2025-05-07T20:32:17.4724355Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.4724637Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.4724939Z         x = x_sign * x_clamp
2025-05-07T20:32:17.4725173Z         x0 = x[:, :D]
2025-05-07T20:32:17.4725378Z         x1 = x[:, D:]
2025-05-07T20:32:17.4725581Z     
2025-05-07T20:32:17.4725757Z         if contiguous:
2025-05-07T20:32:17.4725974Z             x0 = x0.contiguous()
2025-05-07T20:32:17.4726227Z             x1 = x1.contiguous()
2025-05-07T20:32:17.4726459Z     
2025-05-07T20:32:17.4726647Z         if scale_ub is not None:
2025-05-07T20:32:17.4726915Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.4727244Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.4727543Z             )
2025-05-07T20:32:17.4727730Z         else:
2025-05-07T20:32:17.4727933Z             scale_ub_tensor = None
2025-05-07T20:32:17.4728168Z     
2025-05-07T20:32:17.4728393Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4728702Z             op = silu_mul_quant
2025-05-07T20:32:17.4728939Z             if compiled:
2025-05-07T20:32:17.4729180Z                 op = torch.compile(op)
2025-05-07T20:32:17.4729474Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4729738Z     
2025-05-07T20:32:17.4729923Z         y_fp8, y_scale = fn()
2025-05-07T20:32:17.4730198Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:17.4730481Z     
2025-05-07T20:32:17.4730705Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4731036Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:17.4731320Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:17.4731625Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:17.4731977Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.4732278Z     
2025-05-07T20:32:17.4732523Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.4732721Z 
2025-05-07T20:32:17.4732817Z moe/activation_test.py:126: 
2025-05-07T20:32:17.4733108Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4733509Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:17.4733825Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.4734607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:17.4735368Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.4735941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.4736622Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.4737305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:17.4738028Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.4738770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:17.4739520Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.4740244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:17.4740875Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.4741470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:17.4741981Z     fn()
2025-05-07T20:32:17.4742476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:17.4743045Z     self.fn.run(
2025-05-07T20:32:17.4743507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.4744033Z     kernel = self.compile(
2025-05-07T20:32:17.4744574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.4745217Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.4745612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4745841Z 
2025-05-07T20:32:17.4746057Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2ce79a00>
2025-05-07T20:32:17.4747149Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.4748522Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2d3c13a0>}
2025-05-07T20:32:17.4749939Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.4750958Z context = <triton._C.libtriton.ir.context object at 0x7feb2c9a6570>
2025-05-07T20:32:17.4751242Z 
2025-05-07T20:32:17.4751411Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.4751923Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.4752385Z                            module_map=module_map)
2025-05-07T20:32:17.4752742Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.4753089Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.4753339Z E       ^
2025-05-07T20:32:17.4753799Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.4754299Z 
2025-05-07T20:32:17.4754715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.4755294Z 
2025-05-07T20:32:17.4755393Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.4755802Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.4756197Z     T=128,
2025-05-07T20:32:17.4756375Z     D=7168,
2025-05-07T20:32:17.4756551Z     scale_ub=None,
2025-05-07T20:32:17.4756825Z     contiguous=False,
2025-05-07T20:32:17.4757044Z     compiled=False,
2025-05-07T20:32:17.4757237Z )
2025-05-07T20:32:17.4757544Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.4758023Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:17.4758287Z 
2025-05-07T20:32:17.4758360Z     @given(
2025-05-07T20:32:17.4758582Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.4758889Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.4759184Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.4759513Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.4759836Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.4760111Z     )
2025-05-07T20:32:17.4760446Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.4760887Z     def test_silu_mul_quant(
2025-05-07T20:32:17.4772183Z         self,
2025-05-07T20:32:17.4772396Z         T: int,
2025-05-07T20:32:17.4772593Z         D: int,
2025-05-07T20:32:17.4772815Z         scale_ub: Optional[float],
2025-05-07T20:32:17.4773085Z         contiguous: bool,
2025-05-07T20:32:17.4773325Z         compiled: bool,
2025-05-07T20:32:17.4773546Z     ) -> None:
2025-05-07T20:32:17.4773762Z         torch.manual_seed(2025)
2025-05-07T20:32:17.4774013Z     
2025-05-07T20:32:17.4774289Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.4774627Z     
2025-05-07T20:32:17.4774825Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.4775121Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.4775429Z         x = x_sign * x_clamp
2025-05-07T20:32:17.4775674Z         x0 = x[:, :D]
2025-05-07T20:32:17.4775886Z         x1 = x[:, D:]
2025-05-07T20:32:17.4776090Z     
2025-05-07T20:32:17.4776276Z         if contiguous:
2025-05-07T20:32:17.4776504Z             x0 = x0.contiguous()
2025-05-07T20:32:17.4776755Z             x1 = x1.contiguous()
2025-05-07T20:32:17.4776990Z     
2025-05-07T20:32:17.4777177Z         if scale_ub is not None:
2025-05-07T20:32:17.4777449Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.4777786Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.4778103Z             )
2025-05-07T20:32:17.4778297Z         else:
2025-05-07T20:32:17.4778502Z             scale_ub_tensor = None
2025-05-07T20:32:17.4778749Z     
2025-05-07T20:32:17.4778977Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4779287Z             op = silu_mul_quant
2025-05-07T20:32:17.4779539Z             if compiled:
2025-05-07T20:32:17.4779789Z                 op = torch.compile(op)
2025-05-07T20:32:17.4780082Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4780359Z     
2025-05-07T20:32:17.4780548Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.4780712Z 
2025-05-07T20:32:17.4780813Z moe/activation_test.py:117: 
2025-05-07T20:32:17.4781113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4781451Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.4781736Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4782436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.4783210Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.4783747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.4784510Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.4785177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.4785710Z     kernel = self.compile(
2025-05-07T20:32:17.4786252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.4786945Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.4787341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4787570Z 
2025-05-07T20:32:17.4787788Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2cbb0dc0>
2025-05-07T20:32:17.4788893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.4790363Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2d169b80>}
2025-05-07T20:32:17.4791732Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.4792771Z context = <triton._C.libtriton.ir.context object at 0x7feb2c968670>
2025-05-07T20:32:17.4793057Z 
2025-05-07T20:32:17.4793225Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.4793744Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.4794214Z                            module_map=module_map)
2025-05-07T20:32:17.4794581Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.4794930Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.4795180Z E       ^
2025-05-07T20:32:17.4795661Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.4796123Z 
2025-05-07T20:32:17.4796546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.4797066Z 
2025-05-07T20:32:17.4797170Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.4797576Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.4797977Z     T=4096,
2025-05-07T20:32:17.4798161Z     D=5120,
2025-05-07T20:32:17.4798346Z     scale_ub=1200.0,
2025-05-07T20:32:17.4798560Z     contiguous=True,
2025-05-07T20:32:17.4798782Z     compiled=False,
2025-05-07T20:32:17.4798980Z )
2025-05-07T20:32:17.4799299Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.4799846Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:17.4800127Z 
2025-05-07T20:32:17.4800208Z     @given(
2025-05-07T20:32:17.4800438Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.4800750Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.4801057Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.4801383Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.4801718Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.4802000Z     )
2025-05-07T20:32:17.4802342Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.4802783Z     def test_silu_mul_quant(
2025-05-07T20:32:17.4803019Z         self,
2025-05-07T20:32:17.4803256Z         T: int,
2025-05-07T20:32:17.4803448Z         D: int,
2025-05-07T20:32:17.4803668Z         scale_ub: Optional[float],
2025-05-07T20:32:17.4804217Z         contiguous: bool,
2025-05-07T20:32:17.4804452Z         compiled: bool,
2025-05-07T20:32:17.4804667Z     ) -> None:
2025-05-07T20:32:17.4805025Z         torch.manual_seed(2025)
2025-05-07T20:32:17.4805262Z     
2025-05-07T20:32:17.4805533Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.4805870Z     
2025-05-07T20:32:17.4806047Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.4806326Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.4806690Z         x = x_sign * x_clamp
2025-05-07T20:32:17.4806919Z         x0 = x[:, :D]
2025-05-07T20:32:17.4807130Z         x1 = x[:, D:]
2025-05-07T20:32:17.4807329Z     
2025-05-07T20:32:17.4807501Z         if contiguous:
2025-05-07T20:32:17.4807724Z             x0 = x0.contiguous()
2025-05-07T20:32:17.4807975Z             x1 = x1.contiguous()
2025-05-07T20:32:17.4808202Z     
2025-05-07T20:32:17.4808384Z         if scale_ub is not None:
2025-05-07T20:32:17.4808651Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.4808981Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.4809281Z             )
2025-05-07T20:32:17.4809473Z         else:
2025-05-07T20:32:17.4809674Z             scale_ub_tensor = None
2025-05-07T20:32:17.4809934Z     
2025-05-07T20:32:17.4810182Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4810492Z             op = silu_mul_quant
2025-05-07T20:32:17.4810731Z             if compiled:
2025-05-07T20:32:17.4810975Z                 op = torch.compile(op)
2025-05-07T20:32:17.4811266Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4811526Z     
2025-05-07T20:32:17.4811710Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.4811872Z 
2025-05-07T20:32:17.4811972Z moe/activation_test.py:117: 
2025-05-07T20:32:17.4812259Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4812589Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.4812863Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4813564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.4814254Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.4814787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.4815471Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.4816128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.4816654Z     kernel = self.compile(
2025-05-07T20:32:17.4817191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.4817843Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.4818230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4818461Z 
2025-05-07T20:32:17.4818672Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2c982430>
2025-05-07T20:32:17.4819793Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.4821229Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2ccef5e0>}
2025-05-07T20:32:17.4822599Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.4823706Z context = <triton._C.libtriton.ir.context object at 0x7feb2c4d32b0>
2025-05-07T20:32:17.4823999Z 
2025-05-07T20:32:17.4824165Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.4824764Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.4825225Z                            module_map=module_map)
2025-05-07T20:32:17.4825586Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.4825936Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.4826226Z E       ^
2025-05-07T20:32:17.4826689Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.4827146Z 
2025-05-07T20:32:17.4827565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.4828080Z 
2025-05-07T20:32:17.4828184Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.4828590Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.4828981Z     T=1,
2025-05-07T20:32:17.4829153Z     D=5120,
2025-05-07T20:32:17.4829335Z     scale_ub=None,
2025-05-07T20:32:17.4829543Z     contiguous=True,
2025-05-07T20:32:17.4829760Z     compiled=True,
2025-05-07T20:32:17.4830015Z )
2025-05-07T20:32:17.4830326Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.4830811Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.4831076Z 
2025-05-07T20:32:17.4831152Z     @given(
2025-05-07T20:32:17.4831370Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.4831679Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.4831980Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.4832304Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.4832626Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.4832902Z     )
2025-05-07T20:32:17.4833243Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.4833671Z     def test_silu_mul_quant(
2025-05-07T20:32:17.4833903Z         self,
2025-05-07T20:32:17.4834098Z         T: int,
2025-05-07T20:32:17.4834283Z         D: int,
2025-05-07T20:32:17.4834490Z         scale_ub: Optional[float],
2025-05-07T20:32:17.4834575Z         contiguous: bool,
2025-05-07T20:32:17.4834658Z         compiled: bool,
2025-05-07T20:32:17.4834732Z     ) -> None:
2025-05-07T20:32:17.4834824Z         torch.manual_seed(2025)
2025-05-07T20:32:17.4834898Z     
2025-05-07T20:32:17.4835066Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.4835135Z     
2025-05-07T20:32:17.4835227Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.4835347Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.4835439Z         x = x_sign * x_clamp
2025-05-07T20:32:17.4835519Z         x0 = x[:, :D]
2025-05-07T20:32:17.4835592Z         x1 = x[:, D:]
2025-05-07T20:32:17.4835661Z     
2025-05-07T20:32:17.4835739Z         if contiguous:
2025-05-07T20:32:17.4835826Z             x0 = x0.contiguous()
2025-05-07T20:32:17.4835919Z             x1 = x1.contiguous()
2025-05-07T20:32:17.4835984Z     
2025-05-07T20:32:17.4836072Z         if scale_ub is not None:
2025-05-07T20:32:17.4836176Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.4836308Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.4836380Z             )
2025-05-07T20:32:17.4836456Z         else:
2025-05-07T20:32:17.4836546Z             scale_ub_tensor = None
2025-05-07T20:32:17.4836614Z     
2025-05-07T20:32:17.4836738Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4836824Z             op = silu_mul_quant
2025-05-07T20:32:17.4836909Z             if compiled:
2025-05-07T20:32:17.4837005Z                 op = torch.compile(op)
2025-05-07T20:32:17.4837192Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4837260Z     
2025-05-07T20:32:17.4837347Z         y_fp8, y_scale = fn()
2025-05-07T20:32:17.4837464Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:17.4837611Z     
2025-05-07T20:32:17.4837744Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4837841Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:17.4837938Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:17.4838055Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:17.4838311Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.4838379Z     
2025-05-07T20:32:17.4838474Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.4838479Z 
2025-05-07T20:32:17.4838580Z moe/activation_test.py:126: 
2025-05-07T20:32:17.4838702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4838806Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:17.4838941Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.4839505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:17.4839608Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.4839960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.4840182Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.4840551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:17.4840812Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.4841212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:17.4841463Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.4841839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:17.4842005Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.4842341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:17.4842417Z     fn()
2025-05-07T20:32:17.4842812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:17.4842902Z     self.fn.run(
2025-05-07T20:32:17.4843232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.4843322Z     kernel = self.compile(
2025-05-07T20:32:17.4843698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.4843872Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.4843993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4844002Z 
2025-05-07T20:32:17.4844208Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2c91ce80>
2025-05-07T20:32:17.4844991Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.4845503Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2ca9c670>}
2025-05-07T20:32:17.4846248Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.4846492Z context = <triton._C.libtriton.ir.context object at 0x7feb2c464a70>
2025-05-07T20:32:17.4846497Z 
2025-05-07T20:32:17.4846728Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.4846988Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.4847093Z                            module_map=module_map)
2025-05-07T20:32:17.4847251Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.4847391Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.4847472Z E       ^
2025-05-07T20:32:17.4847824Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.4847829Z 
2025-05-07T20:32:17.4848242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.4848249Z 
2025-05-07T20:32:17.4848349Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.4848566Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.4848643Z     T=2048,
2025-05-07T20:32:17.4848720Z     D=5120,
2025-05-07T20:32:17.4848797Z     scale_ub=None,
2025-05-07T20:32:17.4848882Z     contiguous=True,
2025-05-07T20:32:17.4848957Z     compiled=True,
2025-05-07T20:32:17.4849025Z )
2025-05-07T20:32:17.4849239Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.4849408Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.4849412Z 
2025-05-07T20:32:17.4849489Z     @given(
2025-05-07T20:32:17.4849602Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.4849697Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.4849812Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.4849926Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.4850036Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.4850111Z     )
2025-05-07T20:32:17.4850356Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.4850450Z     def test_silu_mul_quant(
2025-05-07T20:32:17.4850522Z         self,
2025-05-07T20:32:17.4850594Z         T: int,
2025-05-07T20:32:17.4850670Z         D: int,
2025-05-07T20:32:17.4850761Z         scale_ub: Optional[float],
2025-05-07T20:32:17.4850846Z         contiguous: bool,
2025-05-07T20:32:17.4850934Z         compiled: bool,
2025-05-07T20:32:17.4851011Z     ) -> None:
2025-05-07T20:32:17.4851103Z         torch.manual_seed(2025)
2025-05-07T20:32:17.4851174Z     
2025-05-07T20:32:17.4851340Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.4851409Z     
2025-05-07T20:32:17.4851498Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.4851619Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.4851706Z         x = x_sign * x_clamp
2025-05-07T20:32:17.4851780Z         x0 = x[:, :D]
2025-05-07T20:32:17.4851853Z         x1 = x[:, D:]
2025-05-07T20:32:17.4851925Z     
2025-05-07T20:32:17.4852006Z         if contiguous:
2025-05-07T20:32:17.4852093Z             x0 = x0.contiguous()
2025-05-07T20:32:17.4852179Z             x1 = x1.contiguous()
2025-05-07T20:32:17.4852243Z     
2025-05-07T20:32:17.4852329Z         if scale_ub is not None:
2025-05-07T20:32:17.4852434Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.4852566Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.4852637Z             )
2025-05-07T20:32:17.4852712Z         else:
2025-05-07T20:32:17.4852800Z             scale_ub_tensor = None
2025-05-07T20:32:17.4852872Z     
2025-05-07T20:32:17.4852995Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4853077Z             op = silu_mul_quant
2025-05-07T20:32:17.4853210Z             if compiled:
2025-05-07T20:32:17.4853307Z                 op = torch.compile(op)
2025-05-07T20:32:17.4853410Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4853482Z     
2025-05-07T20:32:17.4853567Z         y_fp8, y_scale = fn()
2025-05-07T20:32:17.4853758Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:17.4853831Z     
2025-05-07T20:32:17.4853962Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4854059Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:17.4854157Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:17.4854316Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:17.4854455Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.4854523Z     
2025-05-07T20:32:17.4854616Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.4854620Z 
2025-05-07T20:32:17.4854719Z moe/activation_test.py:126: 
2025-05-07T20:32:17.4854843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4854944Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:17.4855076Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.4855647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:17.4855746Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.4856103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.4856325Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.4856686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:17.4856939Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.4857333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:17.4857586Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.4857959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:17.4858124Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.4858458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:17.4858534Z     fn()
2025-05-07T20:32:17.4858932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:17.4859010Z     self.fn.run(
2025-05-07T20:32:17.4859348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.4859439Z     kernel = self.compile(
2025-05-07T20:32:17.4859815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.4859990Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.4860114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4860119Z 
2025-05-07T20:32:17.4860324Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2ce79670>
2025-05-07T20:32:17.4861108Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.4861615Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2c8dc9d0>}
2025-05-07T20:32:17.4862413Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.4862675Z context = <triton._C.libtriton.ir.context object at 0x7feb2c492930>
2025-05-07T20:32:17.4862680Z 
2025-05-07T20:32:17.4862844Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.4863102Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.4863205Z                            module_map=module_map)
2025-05-07T20:32:17.4863429Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.4863526Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.4863603Z E       ^
2025-05-07T20:32:17.4863959Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.4863966Z 
2025-05-07T20:32:17.4864374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.4864378Z 
2025-05-07T20:32:17.4864480Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.4864702Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.4864779Z     T=128,
2025-05-07T20:32:17.4864855Z     D=5120,
2025-05-07T20:32:17.4864931Z     scale_ub=None,
2025-05-07T20:32:17.4865009Z     contiguous=True,
2025-05-07T20:32:17.4865086Z     compiled=True,
2025-05-07T20:32:17.4865153Z )
2025-05-07T20:32:17.4865371Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.4865536Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.4865541Z 
2025-05-07T20:32:17.4865613Z     @given(
2025-05-07T20:32:17.4865731Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.4865826Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.4865938Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.4866053Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.4866164Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.4866236Z     )
2025-05-07T20:32:17.4866488Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.4866575Z     def test_silu_mul_quant(
2025-05-07T20:32:17.4866651Z         self,
2025-05-07T20:32:17.4866723Z         T: int,
2025-05-07T20:32:17.4866794Z         D: int,
2025-05-07T20:32:17.4866899Z         scale_ub: Optional[float],
2025-05-07T20:32:17.4866983Z         contiguous: bool,
2025-05-07T20:32:17.4867064Z         compiled: bool,
2025-05-07T20:32:17.4867143Z     ) -> None:
2025-05-07T20:32:17.4867236Z         torch.manual_seed(2025)
2025-05-07T20:32:17.4867303Z     
2025-05-07T20:32:17.4867470Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.4867541Z     
2025-05-07T20:32:17.4867629Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.4867752Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.4867835Z         x = x_sign * x_clamp
2025-05-07T20:32:17.4867918Z         x0 = x[:, :D]
2025-05-07T20:32:17.4867998Z         x1 = x[:, D:]
2025-05-07T20:32:17.4868067Z     
2025-05-07T20:32:17.4868153Z         if contiguous:
2025-05-07T20:32:17.4868239Z             x0 = x0.contiguous()
2025-05-07T20:32:17.4868321Z             x1 = x1.contiguous()
2025-05-07T20:32:17.4868394Z     
2025-05-07T20:32:17.4868480Z         if scale_ub is not None:
2025-05-07T20:32:17.4868580Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.4868713Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.4868783Z             )
2025-05-07T20:32:17.4868854Z         else:
2025-05-07T20:32:17.4868951Z             scale_ub_tensor = None
2025-05-07T20:32:17.4869020Z     
2025-05-07T20:32:17.4869143Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4869283Z             op = silu_mul_quant
2025-05-07T20:32:17.4869362Z             if compiled:
2025-05-07T20:32:17.4869461Z                 op = torch.compile(op)
2025-05-07T20:32:17.4869562Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4869703Z     
2025-05-07T20:32:17.4869798Z         y_fp8, y_scale = fn()
2025-05-07T20:32:17.4869975Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:17.4870042Z     
2025-05-07T20:32:17.4870176Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4870313Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:17.4870408Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:17.4870530Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:17.4870664Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.4870734Z     
2025-05-07T20:32:17.4870828Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.4870836Z 
2025-05-07T20:32:17.4870930Z moe/activation_test.py:126: 
2025-05-07T20:32:17.4871055Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4871155Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:17.4871294Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.4871861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:17.4871955Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.4872317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.4872533Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.4872892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:17.4873151Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.4873543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:17.4873801Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.4874169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:17.4874329Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.4874668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:17.4874743Z     fn()
2025-05-07T20:32:17.4875140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:17.4875218Z     self.fn.run(
2025-05-07T20:32:17.4875548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.4875643Z     kernel = self.compile(
2025-05-07T20:32:17.4876017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.4876196Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.4876322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4876326Z 
2025-05-07T20:32:17.4876526Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2cb10970>
2025-05-07T20:32:17.4877309Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.4877811Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2c87c940>}
2025-05-07T20:32:17.4878674Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.4878865Z context = <triton._C.libtriton.ir.context object at 0x7feb2bfaf030>
2025-05-07T20:32:17.4878869Z 
2025-05-07T20:32:17.4879029Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.4879292Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.4879437Z                            module_map=module_map)
2025-05-07T20:32:17.4879597Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.4879700Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.4879774Z E       ^
2025-05-07T20:32:17.4880125Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.4880135Z 
2025-05-07T20:32:17.4880543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.4880547Z 
2025-05-07T20:32:17.4880652Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.4880875Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.4880949Z     T=4096,
2025-05-07T20:32:17.4881023Z     D=5120,
2025-05-07T20:32:17.4881106Z     scale_ub=None,
2025-05-07T20:32:17.4881186Z     contiguous=True,
2025-05-07T20:32:17.4881266Z     compiled=True,
2025-05-07T20:32:17.4881336Z )
2025-05-07T20:32:17.4881548Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.4881713Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.4881717Z 
2025-05-07T20:32:17.4881786Z     @given(
2025-05-07T20:32:17.4881903Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.4882003Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.4882111Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.4882225Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.4882346Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.4882417Z     )
2025-05-07T20:32:17.4882655Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.4882750Z     def test_silu_mul_quant(
2025-05-07T20:32:17.4882823Z         self,
2025-05-07T20:32:17.4882899Z         T: int,
2025-05-07T20:32:17.4882971Z         D: int,
2025-05-07T20:32:17.4883065Z         scale_ub: Optional[float],
2025-05-07T20:32:17.4883154Z         contiguous: bool,
2025-05-07T20:32:17.4883233Z         compiled: bool,
2025-05-07T20:32:17.4883306Z     ) -> None:
2025-05-07T20:32:17.4883397Z         torch.manual_seed(2025)
2025-05-07T20:32:17.4883462Z     
2025-05-07T20:32:17.4883625Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.4883696Z     
2025-05-07T20:32:17.4883784Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.4883901Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.4883997Z         x = x_sign * x_clamp
2025-05-07T20:32:17.4884073Z         x0 = x[:, :D]
2025-05-07T20:32:17.4884153Z         x1 = x[:, D:]
2025-05-07T20:32:17.4884219Z     
2025-05-07T20:32:17.4884298Z         if contiguous:
2025-05-07T20:32:17.4884386Z             x0 = x0.contiguous()
2025-05-07T20:32:17.4884473Z             x1 = x1.contiguous()
2025-05-07T20:32:17.4884544Z     
2025-05-07T20:32:17.4884636Z         if scale_ub is not None:
2025-05-07T20:32:17.4884734Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.4884863Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.4884944Z             )
2025-05-07T20:32:17.4885014Z         else:
2025-05-07T20:32:17.4885102Z             scale_ub_tensor = None
2025-05-07T20:32:17.4885233Z     
2025-05-07T20:32:17.4885359Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4885444Z             op = silu_mul_quant
2025-05-07T20:32:17.4885531Z             if compiled:
2025-05-07T20:32:17.4885731Z                 op = torch.compile(op)
2025-05-07T20:32:17.4885835Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4885903Z     
2025-05-07T20:32:17.4885987Z         y_fp8, y_scale = fn()
2025-05-07T20:32:17.4886111Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:17.4886180Z     
2025-05-07T20:32:17.4886349Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4886449Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:17.4886543Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:17.4886660Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:17.4886802Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.4886875Z     
2025-05-07T20:32:17.4886976Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.4886981Z 
2025-05-07T20:32:17.4887074Z moe/activation_test.py:126: 
2025-05-07T20:32:17.4887202Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4887304Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:17.4887433Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.4887996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:17.4888098Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.4888455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.4888679Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.4889041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:17.4889296Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.4889696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:17.4889947Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.4890318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:17.4890485Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.4890825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:17.4890900Z     fn()
2025-05-07T20:32:17.4891294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:17.4891374Z     self.fn.run(
2025-05-07T20:32:17.4891709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.4891796Z     kernel = self.compile(
2025-05-07T20:32:17.4892180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.4892352Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.4892472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4892480Z 
2025-05-07T20:32:17.4892685Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2c603610>
2025-05-07T20:32:17.4893465Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.4894022Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2c5e9700>}
2025-05-07T20:32:17.4894869Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.4895061Z context = <triton._C.libtriton.ir.context object at 0x7feb2ba7a470>
2025-05-07T20:32:17.4895070Z 
2025-05-07T20:32:17.4895230Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.4895526Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.4895632Z                            module_map=module_map)
2025-05-07T20:32:17.4895787Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.4895881Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.4895962Z E       ^
2025-05-07T20:32:17.4896316Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.4896320Z 
2025-05-07T20:32:17.4896738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.4896742Z 
2025-05-07T20:32:17.4896842Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.4897059Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.4897134Z     T=16384,
2025-05-07T20:32:17.4897208Z     D=5120,
2025-05-07T20:32:17.4897285Z     scale_ub=None,
2025-05-07T20:32:17.4897366Z     contiguous=True,
2025-05-07T20:32:17.4897445Z     compiled=True,
2025-05-07T20:32:17.4897511Z )
2025-05-07T20:32:17.4897727Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.4897897Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.4897905Z 
2025-05-07T20:32:17.4897977Z     @given(
2025-05-07T20:32:17.4898092Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.4898183Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.4898300Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.4898413Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.4898523Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.4898597Z     )
2025-05-07T20:32:17.4898837Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.4898928Z     def test_silu_mul_quant(
2025-05-07T20:32:17.4899001Z         self,
2025-05-07T20:32:17.4899074Z         T: int,
2025-05-07T20:32:17.4899151Z         D: int,
2025-05-07T20:32:17.4899243Z         scale_ub: Optional[float],
2025-05-07T20:32:17.4899327Z         contiguous: bool,
2025-05-07T20:32:17.4899408Z         compiled: bool,
2025-05-07T20:32:17.4899482Z     ) -> None:
2025-05-07T20:32:17.4899571Z         torch.manual_seed(2025)
2025-05-07T20:32:17.4899644Z     
2025-05-07T20:32:17.4899805Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.4899875Z     
2025-05-07T20:32:17.4899964Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.4900088Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.4900168Z         x = x_sign * x_clamp
2025-05-07T20:32:17.4900244Z         x0 = x[:, :D]
2025-05-07T20:32:17.4900322Z         x1 = x[:, D:]
2025-05-07T20:32:17.4900390Z     
2025-05-07T20:32:17.4900472Z         if contiguous:
2025-05-07T20:32:17.4900560Z             x0 = x0.contiguous()
2025-05-07T20:32:17.4900649Z             x1 = x1.contiguous()
2025-05-07T20:32:17.4900716Z     
2025-05-07T20:32:17.4900800Z         if scale_ub is not None:
2025-05-07T20:32:17.4900904Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.4901034Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.4901151Z             )
2025-05-07T20:32:17.4901228Z         else:
2025-05-07T20:32:17.4901319Z             scale_ub_tensor = None
2025-05-07T20:32:17.4901389Z     
2025-05-07T20:32:17.4901517Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4905924Z             op = silu_mul_quant
2025-05-07T20:32:17.4906034Z             if compiled:
2025-05-07T20:32:17.4906136Z                 op = torch.compile(op)
2025-05-07T20:32:17.4906249Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4906319Z     
2025-05-07T20:32:17.4906407Z         y_fp8, y_scale = fn()
2025-05-07T20:32:17.4906600Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:17.4906670Z     
2025-05-07T20:32:17.4906810Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4906915Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:17.4907015Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:17.4907141Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:17.4907284Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.4907354Z     
2025-05-07T20:32:17.4907452Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.4907457Z 
2025-05-07T20:32:17.4907562Z moe/activation_test.py:126: 
2025-05-07T20:32:17.4907689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4907801Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:17.4907933Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.4908509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:17.4908611Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.4908972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.4909197Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.4909569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:17.4909976Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.4910382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:17.4910634Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.4911014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:17.4911181Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.4911524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:17.4911603Z     fn()
2025-05-07T20:32:17.4912005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:17.4912089Z     self.fn.run(
2025-05-07T20:32:17.4912426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.4912522Z     kernel = self.compile(
2025-05-07T20:32:17.4912909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.4913081Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.4913208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4913213Z 
2025-05-07T20:32:17.4913421Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2bbfbb20>
2025-05-07T20:32:17.4914214Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.4914799Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2c052d30>}
2025-05-07T20:32:17.4915626Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.4915824Z context = <triton._C.libtriton.ir.context object at 0x7feb2b639270>
2025-05-07T20:32:17.4915868Z 
2025-05-07T20:32:17.4916033Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.4916298Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.4916409Z                            module_map=module_map)
2025-05-07T20:32:17.4916569Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.4916677Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.4916754Z E       ^
2025-05-07T20:32:17.4917112Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.4917122Z 
2025-05-07T20:32:17.4917540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.4917544Z 
2025-05-07T20:32:17.4917644Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.4917864Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.4917943Z     T=1,
2025-05-07T20:32:17.4918018Z     D=5120,
2025-05-07T20:32:17.4918100Z     scale_ub=1200.0,
2025-05-07T20:32:17.4918184Z     contiguous=True,
2025-05-07T20:32:17.4918265Z     compiled=True,
2025-05-07T20:32:17.4918341Z )
2025-05-07T20:32:17.4918557Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.4918721Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:17.4918726Z 
2025-05-07T20:32:17.4918805Z     @given(
2025-05-07T20:32:17.4918922Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.4919023Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.4919142Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.4919256Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.4919370Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.4919440Z     )
2025-05-07T20:32:17.4919686Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.4919783Z     def test_silu_mul_quant(
2025-05-07T20:32:17.4919855Z         self,
2025-05-07T20:32:17.4919928Z         T: int,
2025-05-07T20:32:17.4920004Z         D: int,
2025-05-07T20:32:17.4920101Z         scale_ub: Optional[float],
2025-05-07T20:32:17.4920187Z         contiguous: bool,
2025-05-07T20:32:17.4920279Z         compiled: bool,
2025-05-07T20:32:17.4920356Z     ) -> None:
2025-05-07T20:32:17.4920448Z         torch.manual_seed(2025)
2025-05-07T20:32:17.4920519Z     
2025-05-07T20:32:17.4920689Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.4920759Z     
2025-05-07T20:32:17.4920853Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.4920972Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.4921059Z         x = x_sign * x_clamp
2025-05-07T20:32:17.4921136Z         x0 = x[:, :D]
2025-05-07T20:32:17.4921213Z         x1 = x[:, D:]
2025-05-07T20:32:17.4921290Z     
2025-05-07T20:32:17.4921369Z         if contiguous:
2025-05-07T20:32:17.4921458Z             x0 = x0.contiguous()
2025-05-07T20:32:17.4921548Z             x1 = x1.contiguous()
2025-05-07T20:32:17.4921616Z     
2025-05-07T20:32:17.4921705Z         if scale_ub is not None:
2025-05-07T20:32:17.4921812Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.4921995Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.4922069Z             )
2025-05-07T20:32:17.4922149Z         else:
2025-05-07T20:32:17.4922241Z             scale_ub_tensor = None
2025-05-07T20:32:17.4922312Z     
2025-05-07T20:32:17.4922518Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4922609Z             op = silu_mul_quant
2025-05-07T20:32:17.4922696Z             if compiled:
2025-05-07T20:32:17.4922792Z                 op = torch.compile(op)
2025-05-07T20:32:17.4922895Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4923006Z     
2025-05-07T20:32:17.4923093Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.4923097Z 
2025-05-07T20:32:17.4923192Z moe/activation_test.py:117: 
2025-05-07T20:32:17.4923320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4923417Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.4923519Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4923889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.4923976Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.4924480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.4924576Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.4924930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.4925152Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.4925490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.4925583Z     kernel = self.compile(
2025-05-07T20:32:17.4925957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.4926132Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.4926261Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4926266Z 
2025-05-07T20:32:17.4926474Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2ccff220>
2025-05-07T20:32:17.4927261Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.4927772Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2bb5ec10>}
2025-05-07T20:32:17.4928524Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.4928716Z context = <triton._C.libtriton.ir.context object at 0x7feb2b4f4970>
2025-05-07T20:32:17.4928721Z 
2025-05-07T20:32:17.4928881Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.4929148Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.4929253Z                            module_map=module_map)
2025-05-07T20:32:17.4929414Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.4929510Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.4929587Z E       ^
2025-05-07T20:32:17.4929942Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.4929946Z 
2025-05-07T20:32:17.4930357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.4930407Z 
2025-05-07T20:32:17.4930507Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.4930729Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.4930803Z     T=1,
2025-05-07T20:32:17.4930881Z     D=5120,
2025-05-07T20:32:17.4931057Z     scale_ub=None,
2025-05-07T20:32:17.4931142Z     contiguous=False,
2025-05-07T20:32:17.4931228Z     compiled=True,
2025-05-07T20:32:17.4931297Z )
2025-05-07T20:32:17.4931511Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.4931677Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:17.4931719Z 
2025-05-07T20:32:17.4931795Z     @given(
2025-05-07T20:32:17.4931911Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.4932010Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.4932121Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.4932241Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.4932354Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.4932426Z     )
2025-05-07T20:32:17.4932673Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.4932766Z     def test_silu_mul_quant(
2025-05-07T20:32:17.4932844Z         self,
2025-05-07T20:32:17.4932920Z         T: int,
2025-05-07T20:32:17.4932993Z         D: int,
2025-05-07T20:32:17.4933088Z         scale_ub: Optional[float],
2025-05-07T20:32:17.4933177Z         contiguous: bool,
2025-05-07T20:32:17.4933259Z         compiled: bool,
2025-05-07T20:32:17.4933337Z     ) -> None:
2025-05-07T20:32:17.4933436Z         torch.manual_seed(2025)
2025-05-07T20:32:17.4933507Z     
2025-05-07T20:32:17.4933675Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.4933745Z     
2025-05-07T20:32:17.4933832Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.4933956Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.4934045Z         x = x_sign * x_clamp
2025-05-07T20:32:17.4934121Z         x0 = x[:, :D]
2025-05-07T20:32:17.4934202Z         x1 = x[:, D:]
2025-05-07T20:32:17.4934270Z     
2025-05-07T20:32:17.4934350Z         if contiguous:
2025-05-07T20:32:17.4934447Z             x0 = x0.contiguous()
2025-05-07T20:32:17.4934537Z             x1 = x1.contiguous()
2025-05-07T20:32:17.4934608Z     
2025-05-07T20:32:17.4934698Z         if scale_ub is not None:
2025-05-07T20:32:17.4934800Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.4934935Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.4935011Z             )
2025-05-07T20:32:17.4935086Z         else:
2025-05-07T20:32:17.4935179Z             scale_ub_tensor = None
2025-05-07T20:32:17.4935749Z     
2025-05-07T20:32:17.4935877Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4935970Z             op = silu_mul_quant
2025-05-07T20:32:17.4936050Z             if compiled:
2025-05-07T20:32:17.4936148Z                 op = torch.compile(op)
2025-05-07T20:32:17.4936255Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4936325Z     
2025-05-07T20:32:17.4936411Z         y_fp8, y_scale = fn()
2025-05-07T20:32:17.4936531Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:17.4936605Z     
2025-05-07T20:32:17.4936742Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4936842Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:17.4936938Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:17.4937060Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:17.4937200Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.4937272Z     
2025-05-07T20:32:17.4937371Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.4937376Z 
2025-05-07T20:32:17.4937471Z moe/activation_test.py:126: 
2025-05-07T20:32:17.4937598Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4937750Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:17.4937882Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.4938514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:17.4938612Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.4938969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.4939197Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.4939597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:17.4939853Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.4940245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:17.4940499Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.4940874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:17.4941042Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.4941381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:17.4941458Z     fn()
2025-05-07T20:32:17.4941854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:17.4941938Z     self.fn.run(
2025-05-07T20:32:17.4942267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.4942357Z     kernel = self.compile(
2025-05-07T20:32:17.4942740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.4942914Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.4943046Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4943051Z 
2025-05-07T20:32:17.4943253Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2bbe6b20>
2025-05-07T20:32:17.4944037Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.4944553Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2c5f81f0>}
2025-05-07T20:32:17.4945297Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.4945492Z context = <triton._C.libtriton.ir.context object at 0x7feb2b0a1670>
2025-05-07T20:32:17.4945497Z 
2025-05-07T20:32:17.4945661Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.4945920Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.4946028Z                            module_map=module_map)
2025-05-07T20:32:17.4946186Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.4946287Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.4946359Z E       ^
2025-05-07T20:32:17.4946711Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.4946715Z 
2025-05-07T20:32:17.4947130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.4947178Z 
2025-05-07T20:32:17.4947279Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.4947498Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.4947646Z     T=1,
2025-05-07T20:32:17.4947721Z     D=5120,
2025-05-07T20:32:17.4947802Z     scale_ub=None,
2025-05-07T20:32:17.4947887Z     contiguous=True,
2025-05-07T20:32:17.4947968Z     compiled=False,
2025-05-07T20:32:17.4948041Z )
2025-05-07T20:32:17.4948254Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.4948452Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:17.4948456Z 
2025-05-07T20:32:17.4948533Z     @given(
2025-05-07T20:32:17.4948647Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.4948745Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.4948857Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.4948976Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.4949089Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.4949159Z     )
2025-05-07T20:32:17.4949406Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.4949501Z     def test_silu_mul_quant(
2025-05-07T20:32:17.4949577Z         self,
2025-05-07T20:32:17.4949650Z         T: int,
2025-05-07T20:32:17.4949730Z         D: int,
2025-05-07T20:32:17.4949887Z         scale_ub: Optional[float],
2025-05-07T20:32:17.4949974Z         contiguous: bool,
2025-05-07T20:32:17.4950063Z         compiled: bool,
2025-05-07T20:32:17.4950138Z     ) -> None:
2025-05-07T20:32:17.4950232Z         torch.manual_seed(2025)
2025-05-07T20:32:17.4950300Z     
2025-05-07T20:32:17.4950463Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.4950536Z     
2025-05-07T20:32:17.4950623Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.4950747Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.4950835Z         x = x_sign * x_clamp
2025-05-07T20:32:17.4950913Z         x0 = x[:, :D]
2025-05-07T20:32:17.4950990Z         x1 = x[:, D:]
2025-05-07T20:32:17.4951062Z     
2025-05-07T20:32:17.4951146Z         if contiguous:
2025-05-07T20:32:17.4951234Z             x0 = x0.contiguous()
2025-05-07T20:32:17.4951322Z             x1 = x1.contiguous()
2025-05-07T20:32:17.4951392Z     
2025-05-07T20:32:17.4951480Z         if scale_ub is not None:
2025-05-07T20:32:17.4951585Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.4951720Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.4951795Z             )
2025-05-07T20:32:17.4951869Z         else:
2025-05-07T20:32:17.4951962Z             scale_ub_tensor = None
2025-05-07T20:32:17.4952036Z     
2025-05-07T20:32:17.4952164Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4952249Z             op = silu_mul_quant
2025-05-07T20:32:17.4952336Z             if compiled:
2025-05-07T20:32:17.4952432Z                 op = torch.compile(op)
2025-05-07T20:32:17.4952534Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4952606Z     
2025-05-07T20:32:17.4952697Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.4952702Z 
2025-05-07T20:32:17.4952798Z moe/activation_test.py:117: 
2025-05-07T20:32:17.4952923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4953021Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.4953120Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4953622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.4953719Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.4954076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.4954346Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.4954682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.4954772Z     kernel = self.compile(
2025-05-07T20:32:17.4955225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.4955401Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.4955522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4955564Z 
2025-05-07T20:32:17.4955768Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2c1f4c10>
2025-05-07T20:32:17.4956551Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.4957058Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2bb5eb80>}
2025-05-07T20:32:17.4957820Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.4958010Z context = <triton._C.libtriton.ir.context object at 0x7feb2b460770>
2025-05-07T20:32:17.4958015Z 
2025-05-07T20:32:17.4958183Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.4958441Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.4958545Z                            module_map=module_map)
2025-05-07T20:32:17.4958710Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.4958809Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.4958885Z E       ^
2025-05-07T20:32:17.4959242Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.4959246Z 
2025-05-07T20:32:17.4959661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.4959666Z 
2025-05-07T20:32:17.4959767Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.4959986Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.4960064Z     T=128,
2025-05-07T20:32:17.4960139Z     D=5120,
2025-05-07T20:32:17.4960218Z     scale_ub=None,
2025-05-07T20:32:17.4960299Z     contiguous=False,
2025-05-07T20:32:17.4960382Z     compiled=True,
2025-05-07T20:32:17.4960451Z )
2025-05-07T20:32:17.4960669Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.4960839Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:17.4960846Z 
2025-05-07T20:32:17.4960922Z     @given(
2025-05-07T20:32:17.4961040Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.4961135Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.4961250Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.4961368Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.4961478Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.4961548Z     )
2025-05-07T20:32:17.4961794Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.4961888Z     def test_silu_mul_quant(
2025-05-07T20:32:17.4961964Z         self,
2025-05-07T20:32:17.4962038Z         T: int,
2025-05-07T20:32:17.4962112Z         D: int,
2025-05-07T20:32:17.4962208Z         scale_ub: Optional[float],
2025-05-07T20:32:17.4962295Z         contiguous: bool,
2025-05-07T20:32:17.4962376Z         compiled: bool,
2025-05-07T20:32:17.4962525Z     ) -> None:
2025-05-07T20:32:17.4962615Z         torch.manual_seed(2025)
2025-05-07T20:32:17.4962684Z     
2025-05-07T20:32:17.4962854Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.4962929Z     
2025-05-07T20:32:17.4963090Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.4963216Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.4963301Z         x = x_sign * x_clamp
2025-05-07T20:32:17.4963383Z         x0 = x[:, :D]
2025-05-07T20:32:17.4963458Z         x1 = x[:, D:]
2025-05-07T20:32:17.4963526Z     
2025-05-07T20:32:17.4963650Z         if contiguous:
2025-05-07T20:32:17.4963737Z             x0 = x0.contiguous()
2025-05-07T20:32:17.4963821Z             x1 = x1.contiguous()
2025-05-07T20:32:17.4963893Z     
2025-05-07T20:32:17.4963981Z         if scale_ub is not None:
2025-05-07T20:32:17.4964081Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.4964215Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.4964291Z             )
2025-05-07T20:32:17.4964365Z         else:
2025-05-07T20:32:17.4964460Z             scale_ub_tensor = None
2025-05-07T20:32:17.4964527Z     
2025-05-07T20:32:17.4964656Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4964748Z             op = silu_mul_quant
2025-05-07T20:32:17.4964839Z             if compiled:
2025-05-07T20:32:17.4964936Z                 op = torch.compile(op)
2025-05-07T20:32:17.4965037Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4965110Z     
2025-05-07T20:32:17.4965203Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.4965208Z 
2025-05-07T20:32:17.4965301Z moe/activation_test.py:117: 
2025-05-07T20:32:17.4965426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4965525Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.4965622Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4965993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.4966086Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.4966588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.4966683Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.4967036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.4967263Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.4967600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.4967693Z     kernel = self.compile(
2025-05-07T20:32:17.4968069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.4968245Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.4968375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4968380Z 
2025-05-07T20:32:17.4968584Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2af56eb0>
2025-05-07T20:32:17.4969371Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.4969929Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2c5e9a60>}
2025-05-07T20:32:17.4970684Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.4970925Z context = <triton._C.libtriton.ir.context object at 0x7feb2b3e39f0>
2025-05-07T20:32:17.4970930Z 
2025-05-07T20:32:17.4971092Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.4971488Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.4971594Z                            module_map=module_map)
2025-05-07T20:32:17.4971751Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.4971851Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.4971925Z E       ^
2025-05-07T20:32:17.4972318Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.4972325Z 
2025-05-07T20:32:17.4972734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.4972739Z 
2025-05-07T20:32:17.4972839Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.4973063Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.4973139Z     T=128,
2025-05-07T20:32:17.4973211Z     D=7168,
2025-05-07T20:32:17.4973295Z     scale_ub=1200.0,
2025-05-07T20:32:17.4973382Z     contiguous=False,
2025-05-07T20:32:17.4973464Z     compiled=False,
2025-05-07T20:32:17.4973539Z )
2025-05-07T20:32:17.4973752Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.4973923Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:17.4973930Z 
2025-05-07T20:32:17.4974005Z     @given(
2025-05-07T20:32:17.4974120Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.4974220Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.4974331Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.4974443Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.4974555Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.4974628Z     )
2025-05-07T20:32:17.4974869Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.4974962Z     def test_silu_mul_quant(
2025-05-07T20:32:17.4975036Z         self,
2025-05-07T20:32:17.4975118Z         T: int,
2025-05-07T20:32:17.4975191Z         D: int,
2025-05-07T20:32:17.4975286Z         scale_ub: Optional[float],
2025-05-07T20:32:17.4975374Z         contiguous: bool,
2025-05-07T20:32:17.4975456Z         compiled: bool,
2025-05-07T20:32:17.4975530Z     ) -> None:
2025-05-07T20:32:17.4975623Z         torch.manual_seed(2025)
2025-05-07T20:32:17.4975697Z     
2025-05-07T20:32:17.4975864Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.4975937Z     
2025-05-07T20:32:17.4976025Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.4976146Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.4976234Z         x = x_sign * x_clamp
2025-05-07T20:32:17.4976313Z         x0 = x[:, :D]
2025-05-07T20:32:17.4976393Z         x1 = x[:, D:]
2025-05-07T20:32:17.4976461Z     
2025-05-07T20:32:17.4976540Z         if contiguous:
2025-05-07T20:32:17.4976630Z             x0 = x0.contiguous()
2025-05-07T20:32:17.4976719Z             x1 = x1.contiguous()
2025-05-07T20:32:17.4976786Z     
2025-05-07T20:32:17.4976877Z         if scale_ub is not None:
2025-05-07T20:32:17.4976978Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.4977108Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.4977186Z             )
2025-05-07T20:32:17.4977263Z         else:
2025-05-07T20:32:17.4977354Z             scale_ub_tensor = None
2025-05-07T20:32:17.4977426Z     
2025-05-07T20:32:17.4977551Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4977637Z             op = silu_mul_quant
2025-05-07T20:32:17.4977719Z             if compiled:
2025-05-07T20:32:17.4977816Z                 op = torch.compile(op)
2025-05-07T20:32:17.4977971Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4978039Z     
2025-05-07T20:32:17.4978125Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.4978130Z 
2025-05-07T20:32:17.4978228Z moe/activation_test.py:117: 
2025-05-07T20:32:17.4978424Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4978522Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.4978621Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4979117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.4979254Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.4979606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.4979827Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.4980165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.4980259Z     kernel = self.compile(
2025-05-07T20:32:17.4980637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.4980816Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.4980940Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4980944Z 
2025-05-07T20:32:17.4981151Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2aef2f40>
2025-05-07T20:32:17.4981934Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.4982439Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2b75d4c0>}
2025-05-07T20:32:17.4983198Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.4983390Z context = <triton._C.libtriton.ir.context object at 0x7feb2b447370>
2025-05-07T20:32:17.4983394Z 
2025-05-07T20:32:17.4983558Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.4983815Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.4983923Z                            module_map=module_map)
2025-05-07T20:32:17.4984090Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.4984185Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.4984266Z E       ^
2025-05-07T20:32:17.4984618Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.4984625Z 
2025-05-07T20:32:17.4985034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.4985039Z 
2025-05-07T20:32:17.4985149Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.4985374Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.4985450Z     T=128,
2025-05-07T20:32:17.4985522Z     D=5120,
2025-05-07T20:32:17.4985600Z     scale_ub=None,
2025-05-07T20:32:17.4985687Z     contiguous=False,
2025-05-07T20:32:17.4985767Z     compiled=False,
2025-05-07T20:32:17.4985836Z )
2025-05-07T20:32:17.4986052Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.4986217Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:17.4986221Z 
2025-05-07T20:32:17.4986296Z     @given(
2025-05-07T20:32:17.4986459Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.4986559Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.4986672Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.4986785Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.4986973Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.4987048Z     )
2025-05-07T20:32:17.4987288Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.4987377Z     def test_silu_mul_quant(
2025-05-07T20:32:17.4987455Z         self,
2025-05-07T20:32:17.4987590Z         T: int,
2025-05-07T20:32:17.4987662Z         D: int,
2025-05-07T20:32:17.4987758Z         scale_ub: Optional[float],
2025-05-07T20:32:17.4987844Z         contiguous: bool,
2025-05-07T20:32:17.4987926Z         compiled: bool,
2025-05-07T20:32:17.4988002Z     ) -> None:
2025-05-07T20:32:17.4988092Z         torch.manual_seed(2025)
2025-05-07T20:32:17.4988166Z     
2025-05-07T20:32:17.4988330Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.4988403Z     
2025-05-07T20:32:17.4988496Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.4988615Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.4988704Z         x = x_sign * x_clamp
2025-05-07T20:32:17.4988782Z         x0 = x[:, :D]
2025-05-07T20:32:17.4988859Z         x1 = x[:, D:]
2025-05-07T20:32:17.4988929Z     
2025-05-07T20:32:17.4989011Z         if contiguous:
2025-05-07T20:32:17.4989098Z             x0 = x0.contiguous()
2025-05-07T20:32:17.4989185Z             x1 = x1.contiguous()
2025-05-07T20:32:17.4989260Z     
2025-05-07T20:32:17.4989347Z         if scale_ub is not None:
2025-05-07T20:32:17.4989448Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.4989584Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.4989657Z             )
2025-05-07T20:32:17.4989734Z         else:
2025-05-07T20:32:17.4989877Z             scale_ub_tensor = None
2025-05-07T20:32:17.4989946Z     
2025-05-07T20:32:17.4990079Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.4990165Z             op = silu_mul_quant
2025-05-07T20:32:17.4990248Z             if compiled:
2025-05-07T20:32:17.4990351Z                 op = torch.compile(op)
2025-05-07T20:32:17.4990453Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4990521Z     
2025-05-07T20:32:17.4990611Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.4990616Z 
2025-05-07T20:32:17.4990710Z moe/activation_test.py:117: 
2025-05-07T20:32:17.4990839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4990937Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.4991030Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.4991534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.4991631Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.4991987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.4992209Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.4992548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.4992640Z     kernel = self.compile(
2025-05-07T20:32:17.4993017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.4993193Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.4993319Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.4993323Z 
2025-05-07T20:32:17.4993526Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2c676b20>
2025-05-07T20:32:17.4994306Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.4994928Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2bc2aee0>}
2025-05-07T20:32:17.4995673Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.4995907Z context = <triton._C.libtriton.ir.context object at 0x7feb2aee6df0>
2025-05-07T20:32:17.4995912Z 
2025-05-07T20:32:17.4996073Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.4996332Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.4996439Z                            module_map=module_map)
2025-05-07T20:32:17.4996598Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.4996694Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.4996767Z E       ^
2025-05-07T20:32:17.4997125Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.4997136Z 
2025-05-07T20:32:17.4997543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.4997551Z 
2025-05-07T20:32:17.4997650Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.4997871Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.4997945Z     T=128,
2025-05-07T20:32:17.4998014Z     D=5120,
2025-05-07T20:32:17.4998097Z     scale_ub=1200.0,
2025-05-07T20:32:17.4998178Z     contiguous=True,
2025-05-07T20:32:17.4998259Z     compiled=False,
2025-05-07T20:32:17.4998332Z )
2025-05-07T20:32:17.4998545Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.4998713Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:17.4998717Z 
2025-05-07T20:32:17.4998794Z     @given(
2025-05-07T20:32:17.4998914Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.4999010Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.4999121Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.4999234Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.4999354Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.4999424Z     )
2025-05-07T20:32:17.4999666Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.4999756Z     def test_silu_mul_quant(
2025-05-07T20:32:17.4999828Z         self,
2025-05-07T20:32:17.4999903Z         T: int,
2025-05-07T20:32:17.4999978Z         D: int,
2025-05-07T20:32:17.5000073Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5000160Z         contiguous: bool,
2025-05-07T20:32:17.5000242Z         compiled: bool,
2025-05-07T20:32:17.5000317Z     ) -> None:
2025-05-07T20:32:17.5000416Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5000485Z     
2025-05-07T20:32:17.5000651Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5000725Z     
2025-05-07T20:32:17.5000813Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5000932Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5001025Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5001104Z         x0 = x[:, :D]
2025-05-07T20:32:17.5001183Z         x1 = x[:, D:]
2025-05-07T20:32:17.5001252Z     
2025-05-07T20:32:17.5001333Z         if contiguous:
2025-05-07T20:32:17.5001422Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5001507Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5001625Z     
2025-05-07T20:32:17.5001715Z         if scale_ub is not None:
2025-05-07T20:32:17.5001816Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5001946Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5002023Z             )
2025-05-07T20:32:17.5002172Z         else:
2025-05-07T20:32:17.5002264Z             scale_ub_tensor = None
2025-05-07T20:32:17.5002338Z     
2025-05-07T20:32:17.5002464Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5002554Z             op = silu_mul_quant
2025-05-07T20:32:17.5002634Z             if compiled:
2025-05-07T20:32:17.5002770Z                 op = torch.compile(op)
2025-05-07T20:32:17.5002876Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5002944Z     
2025-05-07T20:32:17.5003030Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5003034Z 
2025-05-07T20:32:17.5003131Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5003254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5003353Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5003451Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5004257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5004357Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5004715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5004934Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5005277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5005368Z     kernel = self.compile(
2025-05-07T20:32:17.5005745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5005923Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5006048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5006052Z 
2025-05-07T20:32:17.5006256Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2b4a8610>
2025-05-07T20:32:17.5007038Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5007544Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2b75d280>}
2025-05-07T20:32:17.5008291Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5008482Z context = <triton._C.libtriton.ir.context object at 0x7feb2b2f4f70>
2025-05-07T20:32:17.5008487Z 
2025-05-07T20:32:17.5008652Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5008913Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5009020Z                            module_map=module_map)
2025-05-07T20:32:17.5009178Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5009271Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5009348Z E       ^
2025-05-07T20:32:17.5009720Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5009727Z 
2025-05-07T20:32:17.5010167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5010176Z 
2025-05-07T20:32:17.5010276Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5010596Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5010678Z     T=1,
2025-05-07T20:32:17.5010750Z     D=7168,
2025-05-07T20:32:17.5010830Z     scale_ub=1200.0,
2025-05-07T20:32:17.5010913Z     contiguous=True,
2025-05-07T20:32:17.5011101Z     compiled=True,
2025-05-07T20:32:17.5011173Z )
2025-05-07T20:32:17.5011387Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5011548Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:17.5011552Z 
2025-05-07T20:32:17.5011686Z     @given(
2025-05-07T20:32:17.5011806Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5011904Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5012020Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5012134Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5012245Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5012320Z     )
2025-05-07T20:32:17.5012561Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5012652Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5012727Z         self,
2025-05-07T20:32:17.5012805Z         T: int,
2025-05-07T20:32:17.5012878Z         D: int,
2025-05-07T20:32:17.5012977Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5013061Z         contiguous: bool,
2025-05-07T20:32:17.5013145Z         compiled: bool,
2025-05-07T20:32:17.5013220Z     ) -> None:
2025-05-07T20:32:17.5013314Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5013388Z     
2025-05-07T20:32:17.5013553Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5013627Z     
2025-05-07T20:32:17.5013716Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5013835Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5013919Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5014003Z         x0 = x[:, :D]
2025-05-07T20:32:17.5014079Z         x1 = x[:, D:]
2025-05-07T20:32:17.5014148Z     
2025-05-07T20:32:17.5014229Z         if contiguous:
2025-05-07T20:32:17.5014316Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5014401Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5014476Z     
2025-05-07T20:32:17.5014564Z         if scale_ub is not None:
2025-05-07T20:32:17.5014669Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5014801Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5014873Z             )
2025-05-07T20:32:17.5014952Z         else:
2025-05-07T20:32:17.5015042Z             scale_ub_tensor = None
2025-05-07T20:32:17.5015110Z     
2025-05-07T20:32:17.5015238Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5015324Z             op = silu_mul_quant
2025-05-07T20:32:17.5015404Z             if compiled:
2025-05-07T20:32:17.5015503Z                 op = torch.compile(op)
2025-05-07T20:32:17.5015607Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5015676Z     
2025-05-07T20:32:17.5015766Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5015771Z 
2025-05-07T20:32:17.5015863Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5015994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5016091Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5016187Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5016557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.5016650Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.5017141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5017238Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5017590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5017874Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5018204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5018389Z     kernel = self.compile(
2025-05-07T20:32:17.5018771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5018943Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5019068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5019110Z 
2025-05-07T20:32:17.5019315Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2b2d4d90>
2025-05-07T20:32:17.5020096Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5020606Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2afb7820>}
2025-05-07T20:32:17.5021356Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5021548Z context = <triton._C.libtriton.ir.context object at 0x7feb2b2917f0>
2025-05-07T20:32:17.5021556Z 
2025-05-07T20:32:17.5021715Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5021973Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5022081Z                            module_map=module_map)
2025-05-07T20:32:17.5022237Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5022337Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5022411Z E       ^
2025-05-07T20:32:17.5022764Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5022769Z 
2025-05-07T20:32:17.5023188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5023193Z 
2025-05-07T20:32:17.5023291Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5023512Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5023591Z     T=1,
2025-05-07T20:32:17.5023664Z     D=7168,
2025-05-07T20:32:17.5023748Z     scale_ub=1200.0,
2025-05-07T20:32:17.5023830Z     contiguous=False,
2025-05-07T20:32:17.5023909Z     compiled=True,
2025-05-07T20:32:17.5023984Z )
2025-05-07T20:32:17.5024197Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5024362Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:17.5024366Z 
2025-05-07T20:32:17.5024444Z     @given(
2025-05-07T20:32:17.5024561Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5024662Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5024778Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5024891Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5025007Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5025076Z     )
2025-05-07T20:32:17.5034765Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5034890Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5035004Z         self,
2025-05-07T20:32:17.5035112Z         T: int,
2025-05-07T20:32:17.5035210Z         D: int,
2025-05-07T20:32:17.5035368Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5035486Z         contiguous: bool,
2025-05-07T20:32:17.5035709Z         compiled: bool,
2025-05-07T20:32:17.5035818Z     ) -> None:
2025-05-07T20:32:17.5035914Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5036013Z     
2025-05-07T20:32:17.5036204Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5036362Z     
2025-05-07T20:32:17.5036458Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5036590Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5036678Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5036756Z         x0 = x[:, :D]
2025-05-07T20:32:17.5036837Z         x1 = x[:, D:]
2025-05-07T20:32:17.5036948Z     
2025-05-07T20:32:17.5037067Z         if contiguous:
2025-05-07T20:32:17.5037157Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5037244Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5037315Z     
2025-05-07T20:32:17.5037403Z         if scale_ub is not None:
2025-05-07T20:32:17.5037507Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5037648Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5037722Z             )
2025-05-07T20:32:17.5037796Z         else:
2025-05-07T20:32:17.5037892Z             scale_ub_tensor = None
2025-05-07T20:32:17.5037963Z     
2025-05-07T20:32:17.5038098Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5038190Z             op = silu_mul_quant
2025-05-07T20:32:17.5038273Z             if compiled:
2025-05-07T20:32:17.5038377Z                 op = torch.compile(op)
2025-05-07T20:32:17.5038483Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5038558Z     
2025-05-07T20:32:17.5038679Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5038685Z 
2025-05-07T20:32:17.5038781Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5038911Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5039015Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5039111Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5039492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.5039593Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.5040173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5040276Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5040687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5040910Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5041260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5041355Z     kernel = self.compile(
2025-05-07T20:32:17.5041742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5041920Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5042048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5042053Z 
2025-05-07T20:32:17.5042272Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2b2afb50>
2025-05-07T20:32:17.5043058Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5043577Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2b391310>}
2025-05-07T20:32:17.5044329Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5044576Z context = <triton._C.libtriton.ir.context object at 0x7feb2b3be7b0>
2025-05-07T20:32:17.5044581Z 
2025-05-07T20:32:17.5044751Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5045085Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5045200Z                            module_map=module_map)
2025-05-07T20:32:17.5045364Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5045463Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5045585Z E       ^
2025-05-07T20:32:17.5045942Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5045947Z 
2025-05-07T20:32:17.5046361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5046370Z 
2025-05-07T20:32:17.5046478Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5046701Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5046783Z     T=1,
2025-05-07T20:32:17.5046860Z     D=7168,
2025-05-07T20:32:17.5046941Z     scale_ub=None,
2025-05-07T20:32:17.5047036Z     contiguous=False,
2025-05-07T20:32:17.5047120Z     compiled=True,
2025-05-07T20:32:17.5047193Z )
2025-05-07T20:32:17.5047417Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5047583Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:17.5047590Z 
2025-05-07T20:32:17.5047675Z     @given(
2025-05-07T20:32:17.5047792Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5047890Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5048013Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5048128Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5048244Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5048318Z     )
2025-05-07T20:32:17.5048561Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5048652Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5048738Z         self,
2025-05-07T20:32:17.5048815Z         T: int,
2025-05-07T20:32:17.5048890Z         D: int,
2025-05-07T20:32:17.5048994Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5049084Z         contiguous: bool,
2025-05-07T20:32:17.5049174Z         compiled: bool,
2025-05-07T20:32:17.5049252Z     ) -> None:
2025-05-07T20:32:17.5049352Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5049427Z     
2025-05-07T20:32:17.5049598Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5049672Z     
2025-05-07T20:32:17.5049771Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5049898Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5049990Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5050112Z         x0 = x[:, :D]
2025-05-07T20:32:17.5050196Z         x1 = x[:, D:]
2025-05-07T20:32:17.5050270Z     
2025-05-07T20:32:17.5050362Z         if contiguous:
2025-05-07T20:32:17.5071171Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5071314Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5071383Z     
2025-05-07T20:32:17.5071473Z         if scale_ub is not None:
2025-05-07T20:32:17.5071581Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5071717Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5071793Z             )
2025-05-07T20:32:17.5071870Z         else:
2025-05-07T20:32:17.5071962Z             scale_ub_tensor = None
2025-05-07T20:32:17.5072030Z     
2025-05-07T20:32:17.5072166Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5072253Z             op = silu_mul_quant
2025-05-07T20:32:17.5072336Z             if compiled:
2025-05-07T20:32:17.5072520Z                 op = torch.compile(op)
2025-05-07T20:32:17.5072628Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5072701Z     
2025-05-07T20:32:17.5072792Z         y_fp8, y_scale = fn()
2025-05-07T20:32:17.5072917Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:17.5073066Z     
2025-05-07T20:32:17.5073218Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5073322Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:17.5073425Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:17.5073550Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:17.5073740Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.5073814Z     
2025-05-07T20:32:17.5073914Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.5073919Z 
2025-05-07T20:32:17.5074023Z moe/activation_test.py:126: 
2025-05-07T20:32:17.5074164Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5074275Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:17.5074421Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.5074994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:17.5075092Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.5075456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5075676Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5076045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:17.5076300Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.5076696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:17.5076953Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.5077329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:17.5077499Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.5077837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:17.5077910Z     fn()
2025-05-07T20:32:17.5078317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:17.5078400Z     self.fn.run(
2025-05-07T20:32:17.5078734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5078829Z     kernel = self.compile(
2025-05-07T20:32:17.5079204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5079383Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5079512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5079517Z 
2025-05-07T20:32:17.5079721Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2b1645b0>
2025-05-07T20:32:17.5080561Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5081076Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2b23b040>}
2025-05-07T20:32:17.5081833Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5082076Z context = <triton._C.libtriton.ir.context object at 0x7feb2b237bb0>
2025-05-07T20:32:17.5082081Z 
2025-05-07T20:32:17.5082318Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5082587Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5082691Z                            module_map=module_map)
2025-05-07T20:32:17.5082856Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5082994Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.5083068Z E       ^
2025-05-07T20:32:17.5083462Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5083467Z 
2025-05-07T20:32:17.5083952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5083960Z 
2025-05-07T20:32:17.5084062Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5084283Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5084361Z     T=1,
2025-05-07T20:32:17.5084437Z     D=5120,
2025-05-07T20:32:17.5084516Z     scale_ub=1200.0,
2025-05-07T20:32:17.5084598Z     contiguous=False,
2025-05-07T20:32:17.5084680Z     compiled=True,
2025-05-07T20:32:17.5084754Z )
2025-05-07T20:32:17.5084968Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5085140Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:17.5085145Z 
2025-05-07T20:32:17.5085220Z     @given(
2025-05-07T20:32:17.5085338Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5085435Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5085545Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5085666Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5085778Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5085850Z     )
2025-05-07T20:32:17.5086102Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5086192Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5086265Z         self,
2025-05-07T20:32:17.5086340Z         T: int,
2025-05-07T20:32:17.5086413Z         D: int,
2025-05-07T20:32:17.5086507Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5086595Z         contiguous: bool,
2025-05-07T20:32:17.5086680Z         compiled: bool,
2025-05-07T20:32:17.5086757Z     ) -> None:
2025-05-07T20:32:17.5086847Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5086916Z     
2025-05-07T20:32:17.5087085Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5087156Z     
2025-05-07T20:32:17.5087242Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5087371Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5087457Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5087535Z         x0 = x[:, :D]
2025-05-07T20:32:17.5087615Z         x1 = x[:, D:]
2025-05-07T20:32:17.5087683Z     
2025-05-07T20:32:17.5087768Z         if contiguous:
2025-05-07T20:32:17.5087860Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5087945Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5088019Z     
2025-05-07T20:32:17.5088116Z         if scale_ub is not None:
2025-05-07T20:32:17.5088220Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5088356Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5088437Z             )
2025-05-07T20:32:17.5088512Z         else:
2025-05-07T20:32:17.5088605Z             scale_ub_tensor = None
2025-05-07T20:32:17.5088682Z     
2025-05-07T20:32:17.5088810Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5088900Z             op = silu_mul_quant
2025-05-07T20:32:17.5089040Z             if compiled:
2025-05-07T20:32:17.5089140Z                 op = torch.compile(op)
2025-05-07T20:32:17.5089245Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5089321Z     
2025-05-07T20:32:17.5089492Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5089497Z 
2025-05-07T20:32:17.5089602Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5089759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5089861Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5089969Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5090384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.5090477Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.5090976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5091078Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5091435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5091660Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5092017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5092110Z     kernel = self.compile(
2025-05-07T20:32:17.5092496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5092672Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5092797Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5092802Z 
2025-05-07T20:32:17.5093017Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2b2a71f0>
2025-05-07T20:32:17.5093801Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5094328Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2b23bf70>}
2025-05-07T20:32:17.5095078Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5095273Z context = <triton._C.libtriton.ir.context object at 0x7feb2b1d7eb0>
2025-05-07T20:32:17.5095283Z 
2025-05-07T20:32:17.5095446Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5095706Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5095822Z                            module_map=module_map)
2025-05-07T20:32:17.5095985Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5096082Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5096168Z E       ^
2025-05-07T20:32:17.5096529Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5096534Z 
2025-05-07T20:32:17.5096952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5096959Z 
2025-05-07T20:32:17.5097063Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5097284Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5097368Z     T=1,
2025-05-07T20:32:17.5097444Z     D=5120,
2025-05-07T20:32:17.5097526Z     scale_ub=1200.0,
2025-05-07T20:32:17.5097618Z     contiguous=False,
2025-05-07T20:32:17.5097702Z     compiled=False,
2025-05-07T20:32:17.5097824Z )
2025-05-07T20:32:17.5098054Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5098220Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:17.5098225Z 
2025-05-07T20:32:17.5098403Z     @given(
2025-05-07T20:32:17.5098525Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5098623Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5098742Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5098858Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5099009Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5099087Z     )
2025-05-07T20:32:17.5099333Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5099430Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5099506Z         self,
2025-05-07T20:32:17.5099587Z         T: int,
2025-05-07T20:32:17.5099669Z         D: int,
2025-05-07T20:32:17.5099772Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5099878Z         contiguous: bool,
2025-05-07T20:32:17.5099978Z         compiled: bool,
2025-05-07T20:32:17.5100068Z     ) -> None:
2025-05-07T20:32:17.5100167Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5100248Z     
2025-05-07T20:32:17.5100415Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5100488Z     
2025-05-07T20:32:17.5100586Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5100709Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5100800Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5100885Z         x0 = x[:, :D]
2025-05-07T20:32:17.5100965Z         x1 = x[:, D:]
2025-05-07T20:32:17.5101042Z     
2025-05-07T20:32:17.5101124Z         if contiguous:
2025-05-07T20:32:17.5101213Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5101305Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5101376Z     
2025-05-07T20:32:17.5101467Z         if scale_ub is not None:
2025-05-07T20:32:17.5101578Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5101712Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5101788Z             )
2025-05-07T20:32:17.5101872Z         else:
2025-05-07T20:32:17.5101965Z             scale_ub_tensor = None
2025-05-07T20:32:17.5102036Z     
2025-05-07T20:32:17.5102170Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5102258Z             op = silu_mul_quant
2025-05-07T20:32:17.5102350Z             if compiled:
2025-05-07T20:32:17.5102453Z                 op = torch.compile(op)
2025-05-07T20:32:17.5102557Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5102635Z     
2025-05-07T20:32:17.5102725Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5102730Z 
2025-05-07T20:32:17.5102825Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5102960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5103063Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5103161Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5103674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5103963Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5104421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5104648Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5104991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5105090Z     kernel = self.compile(
2025-05-07T20:32:17.5105471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5105645Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5105928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5105932Z 
2025-05-07T20:32:17.5106165Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2ac5c0a0>
2025-05-07T20:32:17.5107275Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5107789Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2ac483a0>}
2025-05-07T20:32:17.5108608Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5108803Z context = <triton._C.libtriton.ir.context object at 0x7feb2ac1a3b0>
2025-05-07T20:32:17.5108808Z 
2025-05-07T20:32:17.5108970Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5109246Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5109353Z                            module_map=module_map)
2025-05-07T20:32:17.5109525Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5109622Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5109700Z E       ^
2025-05-07T20:32:17.5110193Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5110198Z 
2025-05-07T20:32:17.5110610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5110615Z 
2025-05-07T20:32:17.5110725Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5110952Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5111033Z     T=16384,
2025-05-07T20:32:17.5111116Z     D=5120,
2025-05-07T20:32:17.5111202Z     scale_ub=1200.0,
2025-05-07T20:32:17.5111295Z     contiguous=False,
2025-05-07T20:32:17.5111387Z     compiled=True,
2025-05-07T20:32:17.5111463Z )
2025-05-07T20:32:17.5111679Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5111860Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:17.5111865Z 
2025-05-07T20:32:17.5111945Z     @given(
2025-05-07T20:32:17.5112066Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5112171Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5112285Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5112406Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5112523Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5112600Z     )
2025-05-07T20:32:17.5112850Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5112943Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5113018Z         self,
2025-05-07T20:32:17.5113105Z         T: int,
2025-05-07T20:32:17.5113180Z         D: int,
2025-05-07T20:32:17.5113278Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5113374Z         contiguous: bool,
2025-05-07T20:32:17.5113457Z         compiled: bool,
2025-05-07T20:32:17.5113542Z     ) -> None:
2025-05-07T20:32:17.5113635Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5113710Z     
2025-05-07T20:32:17.5113884Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5113954Z     
2025-05-07T20:32:17.5114043Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5114170Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5114258Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5114389Z         x0 = x[:, :D]
2025-05-07T20:32:17.5114473Z         x1 = x[:, D:]
2025-05-07T20:32:17.5114544Z     
2025-05-07T20:32:17.5114626Z         if contiguous:
2025-05-07T20:32:17.5114721Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5114886Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5114959Z     
2025-05-07T20:32:17.5115055Z         if scale_ub is not None:
2025-05-07T20:32:17.5115162Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5115301Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5115375Z             )
2025-05-07T20:32:17.5115490Z         else:
2025-05-07T20:32:17.5115589Z             scale_ub_tensor = None
2025-05-07T20:32:17.5115661Z     
2025-05-07T20:32:17.5115790Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5115882Z             op = silu_mul_quant
2025-05-07T20:32:17.5115965Z             if compiled:
2025-05-07T20:32:17.5116064Z                 op = torch.compile(op)
2025-05-07T20:32:17.5116182Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5116254Z     
2025-05-07T20:32:17.5116347Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5116356Z 
2025-05-07T20:32:17.5116451Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5116583Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5116689Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5116788Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5117156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.5117257Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.5117755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5117851Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5118216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5118443Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5118787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5118884Z     kernel = self.compile(
2025-05-07T20:32:17.5119265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5119444Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5119568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5119576Z 
2025-05-07T20:32:17.5119785Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2ac67850>
2025-05-07T20:32:17.5120571Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5121083Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2abfa0d0>}
2025-05-07T20:32:17.5121840Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5122032Z context = <triton._C.libtriton.ir.context object at 0x7feb2abf9430>
2025-05-07T20:32:17.5122039Z 
2025-05-07T20:32:17.5122208Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5122470Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5122576Z                            module_map=module_map)
2025-05-07T20:32:17.5122743Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5122887Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5122967Z E       ^
2025-05-07T20:32:17.5123324Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5123401Z 
2025-05-07T20:32:17.5123815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5123820Z 
2025-05-07T20:32:17.5123927Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5124148Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5124380Z     T=2048,
2025-05-07T20:32:17.5124457Z     D=7168,
2025-05-07T20:32:17.5124544Z     scale_ub=1200.0,
2025-05-07T20:32:17.5124636Z     contiguous=False,
2025-05-07T20:32:17.5124719Z     compiled=True,
2025-05-07T20:32:17.5124790Z )
2025-05-07T20:32:17.5125012Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5125190Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:17.5125195Z 
2025-05-07T20:32:17.5125271Z     @given(
2025-05-07T20:32:17.5125398Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5125506Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5125626Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5125742Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5125856Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5125934Z     )
2025-05-07T20:32:17.5126182Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5126277Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5126358Z         self,
2025-05-07T20:32:17.5126435Z         T: int,
2025-05-07T20:32:17.5126512Z         D: int,
2025-05-07T20:32:17.5126615Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5126703Z         contiguous: bool,
2025-05-07T20:32:17.5126794Z         compiled: bool,
2025-05-07T20:32:17.5126878Z     ) -> None:
2025-05-07T20:32:17.5126971Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5127046Z     
2025-05-07T20:32:17.5127227Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5127299Z     
2025-05-07T20:32:17.5127400Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5127523Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5127610Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5127697Z         x0 = x[:, :D]
2025-05-07T20:32:17.5127777Z         x1 = x[:, D:]
2025-05-07T20:32:17.5127852Z     
2025-05-07T20:32:17.5127941Z         if contiguous:
2025-05-07T20:32:17.5128031Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5128120Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5128196Z     
2025-05-07T20:32:17.5128284Z         if scale_ub is not None:
2025-05-07T20:32:17.5128387Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5128528Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5128602Z             )
2025-05-07T20:32:17.5128684Z         else:
2025-05-07T20:32:17.5128775Z             scale_ub_tensor = None
2025-05-07T20:32:17.5128846Z     
2025-05-07T20:32:17.5128986Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5129075Z             op = silu_mul_quant
2025-05-07T20:32:17.5129158Z             if compiled:
2025-05-07T20:32:17.5129261Z                 op = torch.compile(op)
2025-05-07T20:32:17.5129366Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5129446Z     
2025-05-07T20:32:17.5129545Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5129549Z 
2025-05-07T20:32:17.5129644Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5129778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5129880Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5129980Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5130401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.5130495Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.5131089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5131193Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5131549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5131778Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5132153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5132265Z     kernel = self.compile(
2025-05-07T20:32:17.5132644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5132821Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5132952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5132956Z 
2025-05-07T20:32:17.5133167Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2b24da90>
2025-05-07T20:32:17.5133957Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5134468Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2abfaca0>}
2025-05-07T20:32:17.5135221Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5135416Z context = <triton._C.libtriton.ir.context object at 0x7feb2b133230>
2025-05-07T20:32:17.5135421Z 
2025-05-07T20:32:17.5135584Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5135855Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5135962Z                            module_map=module_map)
2025-05-07T20:32:17.5136124Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5136230Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5136314Z E       ^
2025-05-07T20:32:17.5136680Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5136685Z 
2025-05-07T20:32:17.5137096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5137103Z 
2025-05-07T20:32:17.5137205Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5137434Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5137510Z     T=1,
2025-05-07T20:32:17.5137593Z     D=5120,
2025-05-07T20:32:17.5137681Z     scale_ub=None,
2025-05-07T20:32:17.5137764Z     contiguous=False,
2025-05-07T20:32:17.5137856Z     compiled=False,
2025-05-07T20:32:17.5137929Z )
2025-05-07T20:32:17.5138145Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5138320Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:17.5138327Z 
2025-05-07T20:32:17.5138403Z     @given(
2025-05-07T20:32:17.5138521Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5138628Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5138743Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5138861Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5139028Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5139100Z     )
2025-05-07T20:32:17.5139351Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5139442Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5139590Z         self,
2025-05-07T20:32:17.5139674Z         T: int,
2025-05-07T20:32:17.5139750Z         D: int,
2025-05-07T20:32:17.5139846Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5139942Z         contiguous: bool,
2025-05-07T20:32:17.5140026Z         compiled: bool,
2025-05-07T20:32:17.5140146Z     ) -> None:
2025-05-07T20:32:17.5140246Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5140318Z     
2025-05-07T20:32:17.5140487Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5140566Z     
2025-05-07T20:32:17.5140653Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5140781Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5140874Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5140953Z         x0 = x[:, :D]
2025-05-07T20:32:17.5141036Z         x1 = x[:, D:]
2025-05-07T20:32:17.5141108Z     
2025-05-07T20:32:17.5141192Z         if contiguous:
2025-05-07T20:32:17.5141286Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5141380Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5141452Z     
2025-05-07T20:32:17.5141546Z         if scale_ub is not None:
2025-05-07T20:32:17.5141649Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5141781Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5141864Z             )
2025-05-07T20:32:17.5141939Z         else:
2025-05-07T20:32:17.5142038Z             scale_ub_tensor = None
2025-05-07T20:32:17.5142109Z     
2025-05-07T20:32:17.5142237Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5142331Z             op = silu_mul_quant
2025-05-07T20:32:17.5142415Z             if compiled:
2025-05-07T20:32:17.5142516Z                 op = torch.compile(op)
2025-05-07T20:32:17.5142624Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5142695Z     
2025-05-07T20:32:17.5142785Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5142790Z 
2025-05-07T20:32:17.5142894Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5143021Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5143128Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5143225Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5143726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5143832Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5144195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5144419Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5144768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5144860Z     kernel = self.compile(
2025-05-07T20:32:17.5145255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5145430Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5145554Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5145558Z 
2025-05-07T20:32:17.5145772Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2aecb790>
2025-05-07T20:32:17.5146559Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5147074Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2b178670>}
2025-05-07T20:32:17.5147957Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5148152Z context = <triton._C.libtriton.ir.context object at 0x7feb2b31dab0>
2025-05-07T20:32:17.5148162Z 
2025-05-07T20:32:17.5148324Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5148626Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5148739Z                            module_map=module_map)
2025-05-07T20:32:17.5148901Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5148999Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5149081Z E       ^
2025-05-07T20:32:17.5149439Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5149444Z 
2025-05-07T20:32:17.5149943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5149948Z 
2025-05-07T20:32:17.5150052Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5150273Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5150356Z     T=4096,
2025-05-07T20:32:17.5150431Z     D=7168,
2025-05-07T20:32:17.5150518Z     scale_ub=1200.0,
2025-05-07T20:32:17.5150610Z     contiguous=False,
2025-05-07T20:32:17.5150693Z     compiled=False,
2025-05-07T20:32:17.5150765Z )
2025-05-07T20:32:17.5150990Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5151164Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:17.5151168Z 
2025-05-07T20:32:17.5151254Z     @given(
2025-05-07T20:32:17.5151371Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5151470Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5151593Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5151715Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5151828Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5151907Z     )
2025-05-07T20:32:17.5152154Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5152247Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5152335Z         self,
2025-05-07T20:32:17.5152411Z         T: int,
2025-05-07T20:32:17.5152491Z         D: int,
2025-05-07T20:32:17.5152592Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5152681Z         contiguous: bool,
2025-05-07T20:32:17.5152774Z         compiled: bool,
2025-05-07T20:32:17.5152851Z     ) -> None:
2025-05-07T20:32:17.5152944Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5153029Z     
2025-05-07T20:32:17.5153197Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5153269Z     
2025-05-07T20:32:17.5153365Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5153492Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5153581Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5153672Z         x0 = x[:, :D]
2025-05-07T20:32:17.5153748Z         x1 = x[:, D:]
2025-05-07T20:32:17.5153825Z     
2025-05-07T20:32:17.5153907Z         if contiguous:
2025-05-07T20:32:17.5153998Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5154094Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5154166Z     
2025-05-07T20:32:17.5154256Z         if scale_ub is not None:
2025-05-07T20:32:17.5154367Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5154500Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5154575Z             )
2025-05-07T20:32:17.5154704Z         else:
2025-05-07T20:32:17.5154798Z             scale_ub_tensor = None
2025-05-07T20:32:17.5154871Z     
2025-05-07T20:32:17.5155004Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5155095Z             op = silu_mul_quant
2025-05-07T20:32:17.5155253Z             if compiled:
2025-05-07T20:32:17.5155358Z                 op = torch.compile(op)
2025-05-07T20:32:17.5155465Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5155542Z     
2025-05-07T20:32:17.5155632Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5155636Z 
2025-05-07T20:32:17.5155732Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5155907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5156006Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5156104Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5156613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5156712Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5157078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5157308Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5157648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5157747Z     kernel = self.compile(
2025-05-07T20:32:17.5158130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5158307Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5158439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5158443Z 
2025-05-07T20:32:17.5158648Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2b3374f0>
2025-05-07T20:32:17.5159441Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5159954Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2b086040>}
2025-05-07T20:32:17.5160706Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5160904Z context = <triton._C.libtriton.ir.context object at 0x7feb2b0516f0>
2025-05-07T20:32:17.5160909Z 
2025-05-07T20:32:17.5161072Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5161339Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5161448Z                            module_map=module_map)
2025-05-07T20:32:17.5161620Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5161717Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5161798Z E       ^
2025-05-07T20:32:17.5162158Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5162162Z 
2025-05-07T20:32:17.5162575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5162583Z 
2025-05-07T20:32:17.5162684Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5162918Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5162995Z     T=16384,
2025-05-07T20:32:17.5163077Z     D=7168,
2025-05-07T20:32:17.5163159Z     scale_ub=None,
2025-05-07T20:32:17.5163316Z     contiguous=True,
2025-05-07T20:32:17.5163403Z     compiled=True,
2025-05-07T20:32:17.5163475Z )
2025-05-07T20:32:17.5163700Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5163883Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.5163962Z 
2025-05-07T20:32:17.5164040Z     @given(
2025-05-07T20:32:17.5164159Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5164264Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5164380Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5164542Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5164657Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5164730Z     )
2025-05-07T20:32:17.5164983Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5165075Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5165152Z         self,
2025-05-07T20:32:17.5165237Z         T: int,
2025-05-07T20:32:17.5165314Z         D: int,
2025-05-07T20:32:17.5165410Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5165505Z         contiguous: bool,
2025-05-07T20:32:17.5165590Z         compiled: bool,
2025-05-07T20:32:17.5165669Z     ) -> None:
2025-05-07T20:32:17.5165775Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5165847Z     
2025-05-07T20:32:17.5166021Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5166097Z     
2025-05-07T20:32:17.5166188Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5166321Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5166414Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5166494Z         x0 = x[:, :D]
2025-05-07T20:32:17.5166582Z         x1 = x[:, D:]
2025-05-07T20:32:17.5166655Z     
2025-05-07T20:32:17.5166741Z         if contiguous:
2025-05-07T20:32:17.5166838Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5166928Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5167002Z     
2025-05-07T20:32:17.5167099Z         if scale_ub is not None:
2025-05-07T20:32:17.5167206Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5167348Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5167427Z             )
2025-05-07T20:32:17.5167504Z         else:
2025-05-07T20:32:17.5167601Z             scale_ub_tensor = None
2025-05-07T20:32:17.5167673Z     
2025-05-07T20:32:17.5167802Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5167896Z             op = silu_mul_quant
2025-05-07T20:32:17.5167981Z             if compiled:
2025-05-07T20:32:17.5168080Z                 op = torch.compile(op)
2025-05-07T20:32:17.5168192Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5168265Z     
2025-05-07T20:32:17.5168355Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5168359Z 
2025-05-07T20:32:17.5168494Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5168668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5168809Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5168935Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5169376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.5169474Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.5169970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5170067Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5170430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5170652Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5170995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5171141Z     kernel = self.compile(
2025-05-07T20:32:17.5171521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5171706Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5171906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5171911Z 
2025-05-07T20:32:17.5172125Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2ae2b070>
2025-05-07T20:32:17.5172908Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5173458Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2b086ca0>}
2025-05-07T20:32:17.5174219Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5174420Z context = <triton._C.libtriton.ir.context object at 0x7feb2ad9ad30>
2025-05-07T20:32:17.5174425Z 
2025-05-07T20:32:17.5174593Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5174855Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5174963Z                            module_map=module_map)
2025-05-07T20:32:17.5175134Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5175234Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5175316Z E       ^
2025-05-07T20:32:17.5175679Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5175688Z 
2025-05-07T20:32:17.5176099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5176104Z 
2025-05-07T20:32:17.5176214Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5176440Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5176515Z     T=4096,
2025-05-07T20:32:17.5176597Z     D=5120,
2025-05-07T20:32:17.5176678Z     scale_ub=None,
2025-05-07T20:32:17.5176770Z     contiguous=False,
2025-05-07T20:32:17.5176852Z     compiled=True,
2025-05-07T20:32:17.5176928Z )
2025-05-07T20:32:17.5177152Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5177323Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:17.5177327Z 
2025-05-07T20:32:17.5177404Z     @given(
2025-05-07T20:32:17.5177527Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5182515Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5182658Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5182777Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5182893Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5182972Z     )
2025-05-07T20:32:17.5183220Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5183318Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5183393Z         self,
2025-05-07T20:32:17.5183468Z         T: int,
2025-05-07T20:32:17.5183545Z         D: int,
2025-05-07T20:32:17.5183645Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5183737Z         contiguous: bool,
2025-05-07T20:32:17.5183820Z         compiled: bool,
2025-05-07T20:32:17.5183897Z     ) -> None:
2025-05-07T20:32:17.5183992Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5184062Z     
2025-05-07T20:32:17.5184230Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5184410Z     
2025-05-07T20:32:17.5184501Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5184625Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5184716Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5184794Z         x0 = x[:, :D]
2025-05-07T20:32:17.5184948Z         x1 = x[:, D:]
2025-05-07T20:32:17.5185026Z     
2025-05-07T20:32:17.5185107Z         if contiguous:
2025-05-07T20:32:17.5185196Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5185287Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5185357Z     
2025-05-07T20:32:17.5185448Z         if scale_ub is not None:
2025-05-07T20:32:17.5185594Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5185729Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5185808Z             )
2025-05-07T20:32:17.5185883Z         else:
2025-05-07T20:32:17.5185975Z             scale_ub_tensor = None
2025-05-07T20:32:17.5186051Z     
2025-05-07T20:32:17.5186182Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5186270Z             op = silu_mul_quant
2025-05-07T20:32:17.5186356Z             if compiled:
2025-05-07T20:32:17.5186453Z                 op = torch.compile(op)
2025-05-07T20:32:17.5186565Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5186638Z     
2025-05-07T20:32:17.5186726Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5186731Z 
2025-05-07T20:32:17.5186832Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5186959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5187062Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5187165Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5187536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.5187625Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.5188128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5188226Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5188584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5188812Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5189148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5189244Z     kernel = self.compile(
2025-05-07T20:32:17.5189623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5189796Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5190004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5190008Z 
2025-05-07T20:32:17.5190215Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2af0d700>
2025-05-07T20:32:17.5191009Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5191527Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2aec68b0>}
2025-05-07T20:32:17.5192279Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5192478Z context = <triton._C.libtriton.ir.context object at 0x7feb2b1c2e70>
2025-05-07T20:32:17.5192483Z 
2025-05-07T20:32:17.5192647Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5192962Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5193068Z                            module_map=module_map)
2025-05-07T20:32:17.5193235Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5193411Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5193487Z E       ^
2025-05-07T20:32:17.5193850Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5193855Z 
2025-05-07T20:32:17.5194268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5194335Z 
2025-05-07T20:32:17.5194440Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5194661Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5194736Z     T=4096,
2025-05-07T20:32:17.5194815Z     D=5120,
2025-05-07T20:32:17.5194898Z     scale_ub=1200.0,
2025-05-07T20:32:17.5194984Z     contiguous=False,
2025-05-07T20:32:17.5195069Z     compiled=False,
2025-05-07T20:32:17.5195140Z )
2025-05-07T20:32:17.5195355Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5195539Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:17.5195544Z 
2025-05-07T20:32:17.5195618Z     @given(
2025-05-07T20:32:17.5195734Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5195835Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5195948Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5196070Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5196186Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5196257Z     )
2025-05-07T20:32:17.5196504Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5196594Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5196670Z         self,
2025-05-07T20:32:17.5196746Z         T: int,
2025-05-07T20:32:17.5196820Z         D: int,
2025-05-07T20:32:17.5196915Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5197006Z         contiguous: bool,
2025-05-07T20:32:17.5197094Z         compiled: bool,
2025-05-07T20:32:17.5197176Z     ) -> None:
2025-05-07T20:32:17.5197267Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5197337Z     
2025-05-07T20:32:17.5197506Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5197578Z     
2025-05-07T20:32:17.5197669Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5197797Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5197885Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5197964Z         x0 = x[:, :D]
2025-05-07T20:32:17.5198044Z         x1 = x[:, D:]
2025-05-07T20:32:17.5198114Z     
2025-05-07T20:32:17.5198197Z         if contiguous:
2025-05-07T20:32:17.5198290Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5198379Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5198451Z     
2025-05-07T20:32:17.5198543Z         if scale_ub is not None:
2025-05-07T20:32:17.5198647Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5198795Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5198870Z             )
2025-05-07T20:32:17.5198945Z         else:
2025-05-07T20:32:17.5199041Z             scale_ub_tensor = None
2025-05-07T20:32:17.5199111Z     
2025-05-07T20:32:17.5199241Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5199337Z             op = silu_mul_quant
2025-05-07T20:32:17.5199421Z             if compiled:
2025-05-07T20:32:17.5199519Z                 op = torch.compile(op)
2025-05-07T20:32:17.5199633Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5199703Z     
2025-05-07T20:32:17.5199796Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5199801Z 
2025-05-07T20:32:17.5199894Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5200070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5200173Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5200271Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5200861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5200960Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5201318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5201580Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5201916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5202009Z     kernel = self.compile(
2025-05-07T20:32:17.5202394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5202569Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5202699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5202704Z 
2025-05-07T20:32:17.5202914Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2adb2f70>
2025-05-07T20:32:17.5203699Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5204612Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2afcd040>}
2025-05-07T20:32:17.5205370Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5205575Z context = <triton._C.libtriton.ir.context object at 0x7feb2ad1df30>
2025-05-07T20:32:17.5205580Z 
2025-05-07T20:32:17.5205750Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5206011Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5206124Z                            module_map=module_map)
2025-05-07T20:32:17.5206284Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5206387Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5206462Z E       ^
2025-05-07T20:32:17.5206817Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5206822Z 
2025-05-07T20:32:17.5207238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5207246Z 
2025-05-07T20:32:17.5207346Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5207571Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5207646Z     T=4096,
2025-05-07T20:32:17.5207725Z     D=5120,
2025-05-07T20:32:17.5207811Z     scale_ub=1200.0,
2025-05-07T20:32:17.5207895Z     contiguous=False,
2025-05-07T20:32:17.5207976Z     compiled=True,
2025-05-07T20:32:17.5208052Z )
2025-05-07T20:32:17.5208267Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5208439Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:17.5208444Z 
2025-05-07T20:32:17.5208526Z     @given(
2025-05-07T20:32:17.5208642Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5208744Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5208855Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5209082Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5209199Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5209271Z     )
2025-05-07T20:32:17.5209515Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5209726Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5209804Z         self,
2025-05-07T20:32:17.5209881Z         T: int,
2025-05-07T20:32:17.5209961Z         D: int,
2025-05-07T20:32:17.5210057Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5210143Z         contiguous: bool,
2025-05-07T20:32:17.5210293Z         compiled: bool,
2025-05-07T20:32:17.5210370Z     ) -> None:
2025-05-07T20:32:17.5210467Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5210537Z     
2025-05-07T20:32:17.5210703Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5210780Z     
2025-05-07T20:32:17.5210872Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5210993Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5211087Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5211165Z         x0 = x[:, :D]
2025-05-07T20:32:17.5211241Z         x1 = x[:, D:]
2025-05-07T20:32:17.5211316Z     
2025-05-07T20:32:17.5211397Z         if contiguous:
2025-05-07T20:32:17.5211489Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5211582Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5211655Z     
2025-05-07T20:32:17.5211749Z         if scale_ub is not None:
2025-05-07T20:32:17.5211850Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5211982Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5212065Z             )
2025-05-07T20:32:17.5212140Z         else:
2025-05-07T20:32:17.5212231Z             scale_ub_tensor = None
2025-05-07T20:32:17.5212306Z     
2025-05-07T20:32:17.5212434Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5212523Z             op = silu_mul_quant
2025-05-07T20:32:17.5212612Z             if compiled:
2025-05-07T20:32:17.5212710Z                 op = torch.compile(op)
2025-05-07T20:32:17.5212815Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5212889Z     
2025-05-07T20:32:17.5212976Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5212986Z 
2025-05-07T20:32:17.5213085Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5213212Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5213309Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5213410Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5213778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.5213868Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.5214365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5214459Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5214823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5215045Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5215385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5215486Z     kernel = self.compile(
2025-05-07T20:32:17.5215867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5216044Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5216173Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5216178Z 
2025-05-07T20:32:17.5216382Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2adcd0d0>
2025-05-07T20:32:17.5217171Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5217803Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2afcdee0>}
2025-05-07T20:32:17.5218565Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5218799Z context = <triton._C.libtriton.ir.context object at 0x7feb2b000e70>
2025-05-07T20:32:17.5218804Z 
2025-05-07T20:32:17.5218965Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5219229Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5219338Z                            module_map=module_map)
2025-05-07T20:32:17.5219504Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5219601Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5219675Z E       ^
2025-05-07T20:32:17.5220045Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5220050Z 
2025-05-07T20:32:17.5220460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5220465Z 
2025-05-07T20:32:17.5220567Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5220790Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5220865Z     T=2048,
2025-05-07T20:32:17.5220943Z     D=7168,
2025-05-07T20:32:17.5221024Z     scale_ub=1200.0,
2025-05-07T20:32:17.5221106Z     contiguous=False,
2025-05-07T20:32:17.5221191Z     compiled=False,
2025-05-07T20:32:17.5221263Z )
2025-05-07T20:32:17.5221477Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5221659Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:17.5221663Z 
2025-05-07T20:32:17.5221738Z     @given(
2025-05-07T20:32:17.5221858Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5221961Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5222072Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5222193Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5222306Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5222378Z     )
2025-05-07T20:32:17.5222630Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5222721Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5222796Z         self,
2025-05-07T20:32:17.5222875Z         T: int,
2025-05-07T20:32:17.5222949Z         D: int,
2025-05-07T20:32:17.5223048Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5223137Z         contiguous: bool,
2025-05-07T20:32:17.5223221Z         compiled: bool,
2025-05-07T20:32:17.5223298Z     ) -> None:
2025-05-07T20:32:17.5223395Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5223473Z     
2025-05-07T20:32:17.5223645Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5223716Z     
2025-05-07T20:32:17.5223806Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5223932Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5224019Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5224098Z         x0 = x[:, :D]
2025-05-07T20:32:17.5224182Z         x1 = x[:, D:]
2025-05-07T20:32:17.5224251Z     
2025-05-07T20:32:17.5224331Z         if contiguous:
2025-05-07T20:32:17.5224425Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5224512Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5224581Z     
2025-05-07T20:32:17.5224722Z         if scale_ub is not None:
2025-05-07T20:32:17.5224825Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5224961Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5225038Z             )
2025-05-07T20:32:17.5225112Z         else:
2025-05-07T20:32:17.5225308Z             scale_ub_tensor = None
2025-05-07T20:32:17.5225379Z     
2025-05-07T20:32:17.5225509Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5225600Z             op = silu_mul_quant
2025-05-07T20:32:17.5225683Z             if compiled:
2025-05-07T20:32:17.5225779Z                 op = torch.compile(op)
2025-05-07T20:32:17.5225928Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5225998Z     
2025-05-07T20:32:17.5226085Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5226090Z 
2025-05-07T20:32:17.5226189Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5226314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5226418Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5226516Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5227017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5227123Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5227479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5227699Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5228041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5228131Z     kernel = self.compile(
2025-05-07T20:32:17.5228513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5228686Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5228810Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5228815Z 
2025-05-07T20:32:17.5229022Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2af3d7f0>
2025-05-07T20:32:17.5229864Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5230380Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2ad4d550>}
2025-05-07T20:32:17.5231131Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5231328Z context = <triton._C.libtriton.ir.context object at 0x7feb2ae520b0>
2025-05-07T20:32:17.5231335Z 
2025-05-07T20:32:17.5231497Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5231760Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5231868Z                            module_map=module_map)
2025-05-07T20:32:17.5232027Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5232123Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5232201Z E       ^
2025-05-07T20:32:17.5232555Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5232562Z 
2025-05-07T20:32:17.5232980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5232984Z 
2025-05-07T20:32:17.5233085Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5233353Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5233432Z     T=1,
2025-05-07T20:32:17.5233507Z     D=7168,
2025-05-07T20:32:17.5233587Z     scale_ub=None,
2025-05-07T20:32:17.5233674Z     contiguous=True,
2025-05-07T20:32:17.5233829Z     compiled=False,
2025-05-07T20:32:17.5233901Z )
2025-05-07T20:32:17.5234127Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5234288Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:17.5234293Z 
2025-05-07T20:32:17.5234372Z     @given(
2025-05-07T20:32:17.5234532Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5234629Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5234745Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5234860Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5234969Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5235049Z     )
2025-05-07T20:32:17.5235292Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5235390Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5235465Z         self,
2025-05-07T20:32:17.5235542Z         T: int,
2025-05-07T20:32:17.5235627Z         D: int,
2025-05-07T20:32:17.5235726Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5235813Z         contiguous: bool,
2025-05-07T20:32:17.5235899Z         compiled: bool,
2025-05-07T20:32:17.5235975Z     ) -> None:
2025-05-07T20:32:17.5236067Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5236147Z     
2025-05-07T20:32:17.5236312Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5236387Z     
2025-05-07T20:32:17.5236482Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5236603Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5236691Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5236773Z         x0 = x[:, :D]
2025-05-07T20:32:17.5236853Z         x1 = x[:, D:]
2025-05-07T20:32:17.5236926Z     
2025-05-07T20:32:17.5237006Z         if contiguous:
2025-05-07T20:32:17.5237094Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5237182Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5237257Z     
2025-05-07T20:32:17.5237345Z         if scale_ub is not None:
2025-05-07T20:32:17.5237450Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5237582Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5237657Z             )
2025-05-07T20:32:17.5237739Z         else:
2025-05-07T20:32:17.5237836Z             scale_ub_tensor = None
2025-05-07T20:32:17.5237906Z     
2025-05-07T20:32:17.5238037Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5238123Z             op = silu_mul_quant
2025-05-07T20:32:17.5238209Z             if compiled:
2025-05-07T20:32:17.5238306Z                 op = torch.compile(op)
2025-05-07T20:32:17.5238409Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5238487Z     
2025-05-07T20:32:17.5238574Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5238579Z 
2025-05-07T20:32:17.5238673Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5238807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5238906Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5239002Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5239506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5239603Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5239962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5240187Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5240530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5240669Z     kernel = self.compile(
2025-05-07T20:32:17.5241048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5241299Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5241423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5241428Z 
2025-05-07T20:32:17.5241638Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2ae7b4f0>
2025-05-07T20:32:17.5242423Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5242976Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2acad160>}
2025-05-07T20:32:17.5243725Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5243921Z context = <triton._C.libtriton.ir.context object at 0x7feb2ac8e830>
2025-05-07T20:32:17.5243926Z 
2025-05-07T20:32:17.5244091Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5244352Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5244461Z                            module_map=module_map)
2025-05-07T20:32:17.5244626Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5244724Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5244802Z E       ^
2025-05-07T20:32:17.5245155Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5245163Z 
2025-05-07T20:32:17.5245573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5245583Z 
2025-05-07T20:32:17.5245689Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5245907Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5245987Z     T=16384,
2025-05-07T20:32:17.5246060Z     D=7168,
2025-05-07T20:32:17.5246142Z     scale_ub=1200.0,
2025-05-07T20:32:17.5246230Z     contiguous=False,
2025-05-07T20:32:17.5246315Z     compiled=True,
2025-05-07T20:32:17.5246384Z )
2025-05-07T20:32:17.5246601Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5246774Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:17.5246778Z 
2025-05-07T20:32:17.5246856Z     @given(
2025-05-07T20:32:17.5246973Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5247072Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5247187Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5247300Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5247418Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5247493Z     )
2025-05-07T20:32:17.5247741Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5247830Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5247909Z         self,
2025-05-07T20:32:17.5247984Z         T: int,
2025-05-07T20:32:17.5248064Z         D: int,
2025-05-07T20:32:17.5248162Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5248248Z         contiguous: bool,
2025-05-07T20:32:17.5248336Z         compiled: bool,
2025-05-07T20:32:17.5248410Z     ) -> None:
2025-05-07T20:32:17.5248502Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5248578Z     
2025-05-07T20:32:17.5248741Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5248859Z     
2025-05-07T20:32:17.5248951Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5249072Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5249230Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5249314Z         x0 = x[:, :D]
2025-05-07T20:32:17.5249392Z         x1 = x[:, D:]
2025-05-07T20:32:17.5249462Z     
2025-05-07T20:32:17.5249547Z         if contiguous:
2025-05-07T20:32:17.5249636Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5249723Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5249837Z     
2025-05-07T20:32:17.5249925Z         if scale_ub is not None:
2025-05-07T20:32:17.5250031Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5250162Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5250239Z             )
2025-05-07T20:32:17.5250316Z         else:
2025-05-07T20:32:17.5250406Z             scale_ub_tensor = None
2025-05-07T20:32:17.5250479Z     
2025-05-07T20:32:17.5250611Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5250697Z             op = silu_mul_quant
2025-05-07T20:32:17.5250779Z             if compiled:
2025-05-07T20:32:17.5250883Z                 op = torch.compile(op)
2025-05-07T20:32:17.5250986Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5251055Z     
2025-05-07T20:32:17.5251150Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5251155Z 
2025-05-07T20:32:17.5251248Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5251375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5251477Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5251574Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5251944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.5252035Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.5252529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5252626Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5252984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5253211Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5253548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5253640Z     kernel = self.compile(
2025-05-07T20:32:17.5254020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5254192Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5254322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5254329Z 
2025-05-07T20:32:17.5254531Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2aaf8910>
2025-05-07T20:32:17.5255318Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5255832Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2acaddc0>}
2025-05-07T20:32:17.5256578Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5256776Z context = <triton._C.libtriton.ir.context object at 0x7feb2ab6c730>
2025-05-07T20:32:17.5256780Z 
2025-05-07T20:32:17.5256944Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5257251Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5257360Z                            module_map=module_map)
2025-05-07T20:32:17.5257675Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5257777Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5257851Z E       ^
2025-05-07T20:32:17.5258208Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5258213Z 
2025-05-07T20:32:17.5258665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5258669Z 
2025-05-07T20:32:17.5258770Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5258992Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5259067Z     T=1,
2025-05-07T20:32:17.5259147Z     D=7168,
2025-05-07T20:32:17.5259230Z     scale_ub=None,
2025-05-07T20:32:17.5259314Z     contiguous=False,
2025-05-07T20:32:17.5259396Z     compiled=False,
2025-05-07T20:32:17.5259469Z )
2025-05-07T20:32:17.5259689Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5259856Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:17.5259861Z 
2025-05-07T20:32:17.5259942Z     @given(
2025-05-07T20:32:17.5260069Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5260181Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5260321Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5260435Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5260547Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5260617Z     )
2025-05-07T20:32:17.5260858Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5260955Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5261030Z         self,
2025-05-07T20:32:17.5261103Z         T: int,
2025-05-07T20:32:17.5261183Z         D: int,
2025-05-07T20:32:17.5261278Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5261368Z         contiguous: bool,
2025-05-07T20:32:17.5261456Z         compiled: bool,
2025-05-07T20:32:17.5261529Z     ) -> None:
2025-05-07T20:32:17.5261625Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5261695Z     
2025-05-07T20:32:17.5261860Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5261937Z     
2025-05-07T20:32:17.5262025Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5262146Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5262234Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5262315Z         x0 = x[:, :D]
2025-05-07T20:32:17.5262391Z         x1 = x[:, D:]
2025-05-07T20:32:17.5262466Z     
2025-05-07T20:32:17.5262546Z         if contiguous:
2025-05-07T20:32:17.5262636Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5262727Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5262796Z     
2025-05-07T20:32:17.5262885Z         if scale_ub is not None:
2025-05-07T20:32:17.5262990Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5263125Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5263203Z             )
2025-05-07T20:32:17.5263277Z         else:
2025-05-07T20:32:17.5263365Z             scale_ub_tensor = None
2025-05-07T20:32:17.5263438Z     
2025-05-07T20:32:17.5263566Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5263658Z             op = silu_mul_quant
2025-05-07T20:32:17.5263741Z             if compiled:
2025-05-07T20:32:17.5263837Z                 op = torch.compile(op)
2025-05-07T20:32:17.5263940Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5264013Z     
2025-05-07T20:32:17.5264100Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5264153Z 
2025-05-07T20:32:17.5264249Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5264372Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5264471Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5264642Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5265141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5265238Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5265598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5265860Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5266200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5266291Z     kernel = self.compile(
2025-05-07T20:32:17.5266667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5266845Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5266971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5266975Z 
2025-05-07T20:32:17.5267179Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2ab12dc0>
2025-05-07T20:32:17.5267964Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5268472Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2ad67790>}
2025-05-07T20:32:17.5269239Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5269433Z context = <triton._C.libtriton.ir.context object at 0x7feb2aaaf0f0>
2025-05-07T20:32:17.5269437Z 
2025-05-07T20:32:17.5269607Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5269922Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5270026Z                            module_map=module_map)
2025-05-07T20:32:17.5270193Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5270295Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5270370Z E       ^
2025-05-07T20:32:17.5270728Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5270733Z 
2025-05-07T20:32:17.5271141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5271148Z 
2025-05-07T20:32:17.5271250Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5271469Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5271546Z     T=2048,
2025-05-07T20:32:17.5271622Z     D=7168,
2025-05-07T20:32:17.5271700Z     scale_ub=None,
2025-05-07T20:32:17.5271783Z     contiguous=False,
2025-05-07T20:32:17.5271864Z     compiled=True,
2025-05-07T20:32:17.5271934Z )
2025-05-07T20:32:17.5272149Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5272322Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:17.5272327Z 
2025-05-07T20:32:17.5272400Z     @given(
2025-05-07T20:32:17.5272518Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5272616Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5272731Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5272900Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5273012Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5273081Z     )
2025-05-07T20:32:17.5273401Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5273492Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5273569Z         self,
2025-05-07T20:32:17.5273642Z         T: int,
2025-05-07T20:32:17.5273714Z         D: int,
2025-05-07T20:32:17.5273812Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5273898Z         contiguous: bool,
2025-05-07T20:32:17.5274023Z         compiled: bool,
2025-05-07T20:32:17.5274098Z     ) -> None:
2025-05-07T20:32:17.5274192Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5274260Z     
2025-05-07T20:32:17.5274429Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5274499Z     
2025-05-07T20:32:17.5274586Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5274711Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5274795Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5274875Z         x0 = x[:, :D]
2025-05-07T20:32:17.5274951Z         x1 = x[:, D:]
2025-05-07T20:32:17.5275019Z     
2025-05-07T20:32:17.5275108Z         if contiguous:
2025-05-07T20:32:17.5275197Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5275283Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5275355Z     
2025-05-07T20:32:17.5275442Z         if scale_ub is not None:
2025-05-07T20:32:17.5275544Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5275683Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5275756Z             )
2025-05-07T20:32:17.5275829Z         else:
2025-05-07T20:32:17.5275923Z             scale_ub_tensor = None
2025-05-07T20:32:17.5275993Z     
2025-05-07T20:32:17.5276119Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5276215Z             op = silu_mul_quant
2025-05-07T20:32:17.5276295Z             if compiled:
2025-05-07T20:32:17.5276393Z                 op = torch.compile(op)
2025-05-07T20:32:17.5276496Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5276564Z     
2025-05-07T20:32:17.5276662Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5276666Z 
2025-05-07T20:32:17.5276760Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5276882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5276989Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5277086Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5277459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.5277549Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.5278041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5278142Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5278496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5278726Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5279061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5279150Z     kernel = self.compile(
2025-05-07T20:32:17.5279532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5279707Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5279828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5279833Z 
2025-05-07T20:32:17.5280038Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2aac47c0>
2025-05-07T20:32:17.5280816Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5281447Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2acf8430>}
2025-05-07T20:32:17.5282202Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5282434Z context = <triton._C.libtriton.ir.context object at 0x7feb2acf03f0>
2025-05-07T20:32:17.5282442Z 
2025-05-07T20:32:17.5282603Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5282863Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5282974Z                            module_map=module_map)
2025-05-07T20:32:17.5283133Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5283229Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5283306Z E       ^
2025-05-07T20:32:17.5283665Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5283670Z 
2025-05-07T20:32:17.5284080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5284088Z 
2025-05-07T20:32:17.5284188Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5284406Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5284481Z     T=4096,
2025-05-07T20:32:17.5284553Z     D=7168,
2025-05-07T20:32:17.5284630Z     scale_ub=None,
2025-05-07T20:32:17.5284715Z     contiguous=False,
2025-05-07T20:32:17.5284799Z     compiled=True,
2025-05-07T20:32:17.5284868Z )
2025-05-07T20:32:17.5285090Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5285263Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:17.5285268Z 
2025-05-07T20:32:17.5285358Z     @given(
2025-05-07T20:32:17.5285476Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5285576Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5285702Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5285819Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5285934Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5286015Z     )
2025-05-07T20:32:17.5286258Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5286350Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5286433Z         self,
2025-05-07T20:32:17.5286508Z         T: int,
2025-05-07T20:32:17.5286593Z         D: int,
2025-05-07T20:32:17.5286690Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5286777Z         contiguous: bool,
2025-05-07T20:32:17.5286867Z         compiled: bool,
2025-05-07T20:32:17.5286944Z     ) -> None:
2025-05-07T20:32:17.5287041Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5287121Z     
2025-05-07T20:32:17.5287288Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5287362Z     
2025-05-07T20:32:17.5287462Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5287585Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5287676Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5287763Z         x0 = x[:, :D]
2025-05-07T20:32:17.5287844Z         x1 = x[:, D:]
2025-05-07T20:32:17.5287924Z     
2025-05-07T20:32:17.5288007Z         if contiguous:
2025-05-07T20:32:17.5288097Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5288193Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5288337Z     
2025-05-07T20:32:17.5288426Z         if scale_ub is not None:
2025-05-07T20:32:17.5288536Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5288673Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5288749Z             )
2025-05-07T20:32:17.5289445Z         else:
2025-05-07T20:32:17.5289543Z             scale_ub_tensor = None
2025-05-07T20:32:17.5289614Z     
2025-05-07T20:32:17.5289749Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5289840Z             op = silu_mul_quant
2025-05-07T20:32:17.5289924Z             if compiled:
2025-05-07T20:32:17.5290071Z                 op = torch.compile(op)
2025-05-07T20:32:17.5290178Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5290256Z     
2025-05-07T20:32:17.5290343Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5290348Z 
2025-05-07T20:32:17.5290441Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5290573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5290676Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5290775Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5291153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.5291244Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.5291742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5291838Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5292199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5292430Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5292766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5292858Z     kernel = self.compile(
2025-05-07T20:32:17.5293243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5293418Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5293553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5293558Z 
2025-05-07T20:32:17.5293765Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a99ab80>
2025-05-07T20:32:17.5294549Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5295066Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2aa7e040>}
2025-05-07T20:32:17.5295815Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5296017Z context = <triton._C.libtriton.ir.context object at 0x7feb2aa7d3f0>
2025-05-07T20:32:17.5296027Z 
2025-05-07T20:32:17.5296191Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5296457Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5296563Z                            module_map=module_map)
2025-05-07T20:32:17.5296731Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5296831Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5296907Z E       ^
2025-05-07T20:32:17.5297262Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5297267Z 
2025-05-07T20:32:17.5297728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5297732Z 
2025-05-07T20:32:17.5297835Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5298133Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5298211Z     T=16384,
2025-05-07T20:32:17.5298287Z     D=5120,
2025-05-07T20:32:17.5298375Z     scale_ub=1200.0,
2025-05-07T20:32:17.5298460Z     contiguous=False,
2025-05-07T20:32:17.5298542Z     compiled=False,
2025-05-07T20:32:17.5298620Z )
2025-05-07T20:32:17.5298880Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5299059Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:17.5299069Z 
2025-05-07T20:32:17.5299147Z     @given(
2025-05-07T20:32:17.5299263Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5299368Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5299486Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5299603Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5299720Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5299793Z     )
2025-05-07T20:32:17.5300042Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5300142Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5300218Z         self,
2025-05-07T20:32:17.5300294Z         T: int,
2025-05-07T20:32:17.5300375Z         D: int,
2025-05-07T20:32:17.5300472Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5300567Z         contiguous: bool,
2025-05-07T20:32:17.5300652Z         compiled: bool,
2025-05-07T20:32:17.5300730Z     ) -> None:
2025-05-07T20:32:17.5300828Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5300901Z     
2025-05-07T20:32:17.5304826Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5304907Z     
2025-05-07T20:32:17.5305006Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5305132Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5305216Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5305296Z         x0 = x[:, :D]
2025-05-07T20:32:17.5305378Z         x1 = x[:, D:]
2025-05-07T20:32:17.5305446Z     
2025-05-07T20:32:17.5305528Z         if contiguous:
2025-05-07T20:32:17.5305614Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5305696Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5305765Z     
2025-05-07T20:32:17.5305851Z         if scale_ub is not None:
2025-05-07T20:32:17.5305956Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5306090Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5306160Z             )
2025-05-07T20:32:17.5306231Z         else:
2025-05-07T20:32:17.5306324Z             scale_ub_tensor = None
2025-05-07T20:32:17.5306392Z     
2025-05-07T20:32:17.5306520Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5306605Z             op = silu_mul_quant
2025-05-07T20:32:17.5306686Z             if compiled:
2025-05-07T20:32:17.5306785Z                 op = torch.compile(op)
2025-05-07T20:32:17.5306886Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5306956Z     
2025-05-07T20:32:17.5307044Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5307049Z 
2025-05-07T20:32:17.5307145Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5307270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5307368Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5307466Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5307974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5308068Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5308424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5308768Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5309213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5309303Z     kernel = self.compile(
2025-05-07T20:32:17.5309680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5309916Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5310121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5310126Z 
2025-05-07T20:32:17.5310357Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a8b1040>
2025-05-07T20:32:17.5311340Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5311977Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2aa7e8b0>}
2025-05-07T20:32:17.5312916Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5313137Z context = <triton._C.libtriton.ir.context object at 0x7feb2a90b970>
2025-05-07T20:32:17.5313144Z 
2025-05-07T20:32:17.5313323Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5313632Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5313742Z                            module_map=module_map)
2025-05-07T20:32:17.5313917Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5314022Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5314097Z E       ^
2025-05-07T20:32:17.5314523Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5314532Z 
2025-05-07T20:32:17.5315036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5315041Z 
2025-05-07T20:32:17.5315145Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5315402Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5315481Z     T=16384,
2025-05-07T20:32:17.5315556Z     D=5120,
2025-05-07T20:32:17.5315641Z     scale_ub=1200.0,
2025-05-07T20:32:17.5315726Z     contiguous=True,
2025-05-07T20:32:17.5315808Z     compiled=True,
2025-05-07T20:32:17.5315882Z )
2025-05-07T20:32:17.5316130Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5316325Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:17.5316334Z 
2025-05-07T20:32:17.5316411Z     @given(
2025-05-07T20:32:17.5316532Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5316643Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5316763Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5316886Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5317007Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5317079Z     )
2025-05-07T20:32:17.5317368Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5317466Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5317540Z         self,
2025-05-07T20:32:17.5317616Z         T: int,
2025-05-07T20:32:17.5317695Z         D: int,
2025-05-07T20:32:17.5317799Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5317935Z         contiguous: bool,
2025-05-07T20:32:17.5318015Z         compiled: bool,
2025-05-07T20:32:17.5318088Z     ) -> None:
2025-05-07T20:32:17.5318180Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5318248Z     
2025-05-07T20:32:17.5318486Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5318560Z     
2025-05-07T20:32:17.5318646Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5318762Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5318851Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5318927Z         x0 = x[:, :D]
2025-05-07T20:32:17.5319068Z         x1 = x[:, D:]
2025-05-07T20:32:17.5319138Z     
2025-05-07T20:32:17.5319216Z         if contiguous:
2025-05-07T20:32:17.5319305Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5319388Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5319453Z     
2025-05-07T20:32:17.5319541Z         if scale_ub is not None:
2025-05-07T20:32:17.5319639Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5319775Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5319860Z             )
2025-05-07T20:32:17.5319946Z         else:
2025-05-07T20:32:17.5320042Z             scale_ub_tensor = None
2025-05-07T20:32:17.5320130Z     
2025-05-07T20:32:17.5320261Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5320345Z             op = silu_mul_quant
2025-05-07T20:32:17.5320427Z             if compiled:
2025-05-07T20:32:17.5320520Z                 op = torch.compile(op)
2025-05-07T20:32:17.5320622Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5320692Z     
2025-05-07T20:32:17.5320777Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5320781Z 
2025-05-07T20:32:17.5320876Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5320998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5321094Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5321188Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5321551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.5321641Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.5322140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5322234Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5322589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5322814Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5323148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5323239Z     kernel = self.compile(
2025-05-07T20:32:17.5323616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5323790Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5323910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5323915Z 
2025-05-07T20:32:17.5324122Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a906910>
2025-05-07T20:32:17.5324903Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5325412Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2aa185e0>}
2025-05-07T20:32:17.5326157Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5326396Z context = <triton._C.libtriton.ir.context object at 0x7feb2a8795f0>
2025-05-07T20:32:17.5326401Z 
2025-05-07T20:32:17.5326566Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5326898Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5327002Z                            module_map=module_map)
2025-05-07T20:32:17.5327163Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5327254Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5327365Z E       ^
2025-05-07T20:32:17.5327729Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5327734Z 
2025-05-07T20:32:17.5328142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5328150Z 
2025-05-07T20:32:17.5328251Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5328468Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5328543Z     T=16384,
2025-05-07T20:32:17.5328617Z     D=5120,
2025-05-07T20:32:17.5328697Z     scale_ub=None,
2025-05-07T20:32:17.5328780Z     contiguous=False,
2025-05-07T20:32:17.5328861Z     compiled=True,
2025-05-07T20:32:17.5328926Z )
2025-05-07T20:32:17.5329145Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5329318Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:17.5329324Z 
2025-05-07T20:32:17.5329394Z     @given(
2025-05-07T20:32:17.5329511Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5329604Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5329714Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5329830Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5329943Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5330013Z     )
2025-05-07T20:32:17.5330259Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5330354Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5330428Z         self,
2025-05-07T20:32:17.5330499Z         T: int,
2025-05-07T20:32:17.5330569Z         D: int,
2025-05-07T20:32:17.5330669Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5330752Z         contiguous: bool,
2025-05-07T20:32:17.5330832Z         compiled: bool,
2025-05-07T20:32:17.5330914Z     ) -> None:
2025-05-07T20:32:17.5331002Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5331069Z     
2025-05-07T20:32:17.5331234Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5331303Z     
2025-05-07T20:32:17.5331387Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5331506Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5331594Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5331672Z         x0 = x[:, :D]
2025-05-07T20:32:17.5331745Z         x1 = x[:, D:]
2025-05-07T20:32:17.5331816Z     
2025-05-07T20:32:17.5331897Z         if contiguous:
2025-05-07T20:32:17.5331991Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5332072Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5332144Z     
2025-05-07T20:32:17.5332227Z         if scale_ub is not None:
2025-05-07T20:32:17.5332326Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5332457Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5332529Z             )
2025-05-07T20:32:17.5332598Z         else:
2025-05-07T20:32:17.5332691Z             scale_ub_tensor = None
2025-05-07T20:32:17.5332758Z     
2025-05-07T20:32:17.5332883Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5332967Z             op = silu_mul_quant
2025-05-07T20:32:17.5333044Z             if compiled:
2025-05-07T20:32:17.5333189Z                 op = torch.compile(op)
2025-05-07T20:32:17.5333290Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5333357Z     
2025-05-07T20:32:17.5333445Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5333449Z 
2025-05-07T20:32:17.5333617Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5333740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5333839Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5333935Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5334301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.5334431Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.5334925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5335021Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5335377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5335597Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5335938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5336026Z     kernel = self.compile(
2025-05-07T20:32:17.5336403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5336573Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5336696Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5336700Z 
2025-05-07T20:32:17.5336905Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a84df40>
2025-05-07T20:32:17.5337686Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5338206Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a97c5e0>}
2025-05-07T20:32:17.5338956Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5339147Z context = <triton._C.libtriton.ir.context object at 0x7feb2a986330>
2025-05-07T20:32:17.5339154Z 
2025-05-07T20:32:17.5339314Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5339569Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5339674Z                            module_map=module_map)
2025-05-07T20:32:17.5339831Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5339924Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5339999Z E       ^
2025-05-07T20:32:17.5340358Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5340363Z 
2025-05-07T20:32:17.5340771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5340776Z 
2025-05-07T20:32:17.5340873Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5341092Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5341169Z     T=2048,
2025-05-07T20:32:17.5341238Z     D=5120,
2025-05-07T20:32:17.5341314Z     scale_ub=None,
2025-05-07T20:32:17.5341398Z     contiguous=False,
2025-05-07T20:32:17.5341478Z     compiled=True,
2025-05-07T20:32:17.5341546Z )
2025-05-07T20:32:17.5341817Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5341984Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:17.5341988Z 
2025-05-07T20:32:17.5342065Z     @given(
2025-05-07T20:32:17.5342248Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5342344Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5342457Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5342567Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5342674Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5342784Z     )
2025-05-07T20:32:17.5343024Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5343117Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5343188Z         self,
2025-05-07T20:32:17.5343259Z         T: int,
2025-05-07T20:32:17.5343335Z         D: int,
2025-05-07T20:32:17.5343426Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5343511Z         contiguous: bool,
2025-05-07T20:32:17.5343594Z         compiled: bool,
2025-05-07T20:32:17.5343665Z     ) -> None:
2025-05-07T20:32:17.5343754Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5343826Z     
2025-05-07T20:32:17.5343992Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5344063Z     
2025-05-07T20:32:17.5344149Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5344267Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5344348Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5344431Z         x0 = x[:, :D]
2025-05-07T20:32:17.5344506Z         x1 = x[:, D:]
2025-05-07T20:32:17.5344577Z     
2025-05-07T20:32:17.5344653Z         if contiguous:
2025-05-07T20:32:17.5344738Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5344823Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5344889Z     
2025-05-07T20:32:17.5344976Z         if scale_ub is not None:
2025-05-07T20:32:17.5345079Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5345206Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5345276Z             )
2025-05-07T20:32:17.5345347Z         else:
2025-05-07T20:32:17.5345440Z             scale_ub_tensor = None
2025-05-07T20:32:17.5345507Z     
2025-05-07T20:32:17.5345634Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5345719Z             op = silu_mul_quant
2025-05-07T20:32:17.5345804Z             if compiled:
2025-05-07T20:32:17.5345899Z                 op = torch.compile(op)
2025-05-07T20:32:17.5346001Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5346073Z     
2025-05-07T20:32:17.5346158Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5346163Z 
2025-05-07T20:32:17.5346254Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5346379Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5346475Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5346571Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5346944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.5347032Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.5347532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5347626Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5347979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5348204Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5348537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5348625Z     kernel = self.compile(
2025-05-07T20:32:17.5349006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5349226Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5349349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5349450Z 
2025-05-07T20:32:17.5349654Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a86f5b0>
2025-05-07T20:32:17.5350490Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5351041Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2aa18c10>}
2025-05-07T20:32:17.5351784Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5351982Z context = <triton._C.libtriton.ir.context object at 0x7feb2ab97430>
2025-05-07T20:32:17.5351986Z 
2025-05-07T20:32:17.5352153Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5352411Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5352512Z                            module_map=module_map)
2025-05-07T20:32:17.5352669Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5352769Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5352838Z E       ^
2025-05-07T20:32:17.5353191Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5353196Z 
2025-05-07T20:32:17.5353607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5353614Z 
2025-05-07T20:32:17.5353712Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5353932Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5354004Z     T=2048,
2025-05-07T20:32:17.5354078Z     D=5120,
2025-05-07T20:32:17.5354164Z     scale_ub=1200.0,
2025-05-07T20:32:17.5354246Z     contiguous=False,
2025-05-07T20:32:17.5354321Z     compiled=True,
2025-05-07T20:32:17.5354392Z )
2025-05-07T20:32:17.5354607Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5354781Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:17.5354789Z 
2025-05-07T20:32:17.5354862Z     @given(
2025-05-07T20:32:17.5354973Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5355069Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5355179Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5355295Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5355407Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5355477Z     )
2025-05-07T20:32:17.5355722Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5355811Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5355880Z         self,
2025-05-07T20:32:17.5355948Z         T: int,
2025-05-07T20:32:17.5356022Z         D: int,
2025-05-07T20:32:17.5356115Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5356200Z         contiguous: bool,
2025-05-07T20:32:17.5356283Z         compiled: bool,
2025-05-07T20:32:17.5356355Z     ) -> None:
2025-05-07T20:32:17.5356446Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5356514Z     
2025-05-07T20:32:17.5356677Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5356748Z     
2025-05-07T20:32:17.5356832Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5356999Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5357083Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5357155Z         x0 = x[:, :D]
2025-05-07T20:32:17.5357230Z         x1 = x[:, D:]
2025-05-07T20:32:17.5357302Z     
2025-05-07T20:32:17.5357452Z         if contiguous:
2025-05-07T20:32:17.5357540Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5357623Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5357690Z     
2025-05-07T20:32:17.5357777Z         if scale_ub is not None:
2025-05-07T20:32:17.5357876Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5358045Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5358121Z             )
2025-05-07T20:32:17.5358194Z         else:
2025-05-07T20:32:17.5358284Z             scale_ub_tensor = None
2025-05-07T20:32:17.5358353Z     
2025-05-07T20:32:17.5358477Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5358558Z             op = silu_mul_quant
2025-05-07T20:32:17.5358642Z             if compiled:
2025-05-07T20:32:17.5358736Z                 op = torch.compile(op)
2025-05-07T20:32:17.5358837Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5358906Z     
2025-05-07T20:32:17.5358992Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5359001Z 
2025-05-07T20:32:17.5359095Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5359215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5359309Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5359407Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5359773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.5359862Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.5360355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5360449Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5360802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5361022Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5361360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5361451Z     kernel = self.compile(
2025-05-07T20:32:17.5361829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5362003Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5362123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5362127Z 
2025-05-07T20:32:17.5362328Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a71af70>
2025-05-07T20:32:17.5363110Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5363622Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a748820>}
2025-05-07T20:32:17.5364375Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5364569Z context = <triton._C.libtriton.ir.context object at 0x7feb2aa00230>
2025-05-07T20:32:17.5364573Z 
2025-05-07T20:32:17.5364745Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5365005Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5365154Z                            module_map=module_map)
2025-05-07T20:32:17.5365312Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5365404Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5365476Z E       ^
2025-05-07T20:32:17.5365900Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5365905Z 
2025-05-07T20:32:17.5366314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5366325Z 
2025-05-07T20:32:17.5366464Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5366681Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5366758Z     T=4096,
2025-05-07T20:32:17.5366828Z     D=5120,
2025-05-07T20:32:17.5366906Z     scale_ub=1200.0,
2025-05-07T20:32:17.5366986Z     contiguous=True,
2025-05-07T20:32:17.5367063Z     compiled=True,
2025-05-07T20:32:17.5367132Z )
2025-05-07T20:32:17.5367346Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5367512Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:17.5367516Z 
2025-05-07T20:32:17.5367599Z     @given(
2025-05-07T20:32:17.5367710Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5367802Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5367913Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5368024Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5368134Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5368211Z     )
2025-05-07T20:32:17.5368451Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5368539Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5368612Z         self,
2025-05-07T20:32:17.5368681Z         T: int,
2025-05-07T20:32:17.5368756Z         D: int,
2025-05-07T20:32:17.5368853Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5368937Z         contiguous: bool,
2025-05-07T20:32:17.5369016Z         compiled: bool,
2025-05-07T20:32:17.5369088Z     ) -> None:
2025-05-07T20:32:17.5369175Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5369249Z     
2025-05-07T20:32:17.5369411Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5369477Z     
2025-05-07T20:32:17.5369567Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5369685Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5369771Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5369846Z         x0 = x[:, :D]
2025-05-07T20:32:17.5369920Z         x1 = x[:, D:]
2025-05-07T20:32:17.5369989Z     
2025-05-07T20:32:17.5370068Z         if contiguous:
2025-05-07T20:32:17.5370153Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5370238Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5370307Z     
2025-05-07T20:32:17.5370394Z         if scale_ub is not None:
2025-05-07T20:32:17.5370494Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5370622Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5370693Z             )
2025-05-07T20:32:17.5370769Z         else:
2025-05-07T20:32:17.5370864Z             scale_ub_tensor = None
2025-05-07T20:32:17.5370928Z     
2025-05-07T20:32:17.5371055Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5371137Z             op = silu_mul_quant
2025-05-07T20:32:17.5371215Z             if compiled:
2025-05-07T20:32:17.5371316Z                 op = torch.compile(op)
2025-05-07T20:32:17.5371414Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5371481Z     
2025-05-07T20:32:17.5371570Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5371574Z 
2025-05-07T20:32:17.5371665Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5371787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5371931Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5372024Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5372390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.5372550Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.5373043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5373139Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5373493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5373756Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5374089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5374176Z     kernel = self.compile(
2025-05-07T20:32:17.5374554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5374723Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5374852Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5374856Z 
2025-05-07T20:32:17.5375058Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a9f3160>
2025-05-07T20:32:17.5375838Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5376347Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a6e1430>}
2025-05-07T20:32:17.5377097Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5377292Z context = <triton._C.libtriton.ir.context object at 0x7feb2a6d57f0>
2025-05-07T20:32:17.5377297Z 
2025-05-07T20:32:17.5377459Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5377717Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5377820Z                            module_map=module_map)
2025-05-07T20:32:17.5377978Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5378076Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5378146Z E       ^
2025-05-07T20:32:17.5378497Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5378501Z 
2025-05-07T20:32:17.5378913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5378920Z 
2025-05-07T20:32:17.5379015Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5379239Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5379310Z     T=128,
2025-05-07T20:32:17.5379379Z     D=5120,
2025-05-07T20:32:17.5379461Z     scale_ub=1200.0,
2025-05-07T20:32:17.5379540Z     contiguous=False,
2025-05-07T20:32:17.5379616Z     compiled=True,
2025-05-07T20:32:17.5379686Z )
2025-05-07T20:32:17.5379901Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5380069Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:17.5380073Z 
2025-05-07T20:32:17.5380160Z     @given(
2025-05-07T20:32:17.5380287Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5380404Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5380569Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5380682Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5380794Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5380861Z     )
2025-05-07T20:32:17.5381198Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5381295Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5381368Z         self,
2025-05-07T20:32:17.5381439Z         T: int,
2025-05-07T20:32:17.5381512Z         D: int,
2025-05-07T20:32:17.5381604Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5381725Z         contiguous: bool,
2025-05-07T20:32:17.5381808Z         compiled: bool,
2025-05-07T20:32:17.5381878Z     ) -> None:
2025-05-07T20:32:17.5381967Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5382035Z     
2025-05-07T20:32:17.5382196Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5382270Z     
2025-05-07T20:32:17.5382360Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5382478Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5382563Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5382638Z         x0 = x[:, :D]
2025-05-07T20:32:17.5382714Z         x1 = x[:, D:]
2025-05-07T20:32:17.5382793Z     
2025-05-07T20:32:17.5382869Z         if contiguous:
2025-05-07T20:32:17.5382954Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5383042Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5383107Z     
2025-05-07T20:32:17.5383190Z         if scale_ub is not None:
2025-05-07T20:32:17.5383290Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5383420Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5383495Z             )
2025-05-07T20:32:17.5383567Z         else:
2025-05-07T20:32:17.5383655Z             scale_ub_tensor = None
2025-05-07T20:32:17.5383724Z     
2025-05-07T20:32:17.5383847Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5383934Z             op = silu_mul_quant
2025-05-07T20:32:17.5384017Z             if compiled:
2025-05-07T20:32:17.5384109Z                 op = torch.compile(op)
2025-05-07T20:32:17.5384209Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5384282Z     
2025-05-07T20:32:17.5384371Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5384375Z 
2025-05-07T20:32:17.5384470Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5384592Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5384687Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5384787Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5385153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.5385237Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.5385729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5385824Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5386181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5386404Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5386738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5386828Z     kernel = self.compile(
2025-05-07T20:32:17.5387202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5387375Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5387498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5387503Z 
2025-05-07T20:32:17.5387705Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a6f4880>
2025-05-07T20:32:17.5388541Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5389190Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a63f040>}
2025-05-07T20:32:17.5389999Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5390232Z context = <triton._C.libtriton.ir.context object at 0x7feb2a640870>
2025-05-07T20:32:17.5390237Z 
2025-05-07T20:32:17.5390394Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5390657Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5390762Z                            module_map=module_map)
2025-05-07T20:32:17.5390917Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5391010Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5391084Z E       ^
2025-05-07T20:32:17.5391439Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5391444Z 
2025-05-07T20:32:17.5391852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5391859Z 
2025-05-07T20:32:17.5391958Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5392177Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5392247Z     T=16384,
2025-05-07T20:32:17.5392318Z     D=7168,
2025-05-07T20:32:17.5392394Z     scale_ub=1200.0,
2025-05-07T20:32:17.5392473Z     contiguous=True,
2025-05-07T20:32:17.5392554Z     compiled=True,
2025-05-07T20:32:17.5392621Z )
2025-05-07T20:32:17.5392832Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5393005Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:17.5393014Z 
2025-05-07T20:32:17.5393084Z     @given(
2025-05-07T20:32:17.5393197Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5393292Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5393402Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5393522Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5393629Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5393696Z     )
2025-05-07T20:32:17.5393942Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5394029Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5394101Z         self,
2025-05-07T20:32:17.5394178Z         T: int,
2025-05-07T20:32:17.5394250Z         D: int,
2025-05-07T20:32:17.5394340Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5394426Z         contiguous: bool,
2025-05-07T20:32:17.5394504Z         compiled: bool,
2025-05-07T20:32:17.5394576Z     ) -> None:
2025-05-07T20:32:17.5394671Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5394738Z     
2025-05-07T20:32:17.5394901Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5394967Z     
2025-05-07T20:32:17.5395053Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5395174Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5395261Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5395335Z         x0 = x[:, :D]
2025-05-07T20:32:17.5395411Z         x1 = x[:, D:]
2025-05-07T20:32:17.5395478Z     
2025-05-07T20:32:17.5395554Z         if contiguous:
2025-05-07T20:32:17.5395643Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5395725Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5395836Z     
2025-05-07T20:32:17.5395923Z         if scale_ub is not None:
2025-05-07T20:32:17.5396021Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5396149Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5396293Z             )
2025-05-07T20:32:17.5396366Z         else:
2025-05-07T20:32:17.5396457Z             scale_ub_tensor = None
2025-05-07T20:32:17.5396527Z     
2025-05-07T20:32:17.5396651Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5396737Z             op = silu_mul_quant
2025-05-07T20:32:17.5396854Z             if compiled:
2025-05-07T20:32:17.5396949Z                 op = torch.compile(op)
2025-05-07T20:32:17.5397055Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5397121Z     
2025-05-07T20:32:17.5397204Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5397208Z 
2025-05-07T20:32:17.5397307Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5397427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5397535Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5397629Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5397996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.5398086Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.5398575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5398668Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5399026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5399248Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5399584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5399677Z     kernel = self.compile(
2025-05-07T20:32:17.5400052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5400228Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5400358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5400362Z 
2025-05-07T20:32:17.5400567Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a63d580>
2025-05-07T20:32:17.5401347Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5401863Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a63fb80>}
2025-05-07T20:32:17.5402612Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5402805Z context = <triton._C.libtriton.ir.context object at 0x7feb2a694eb0>
2025-05-07T20:32:17.5402810Z 
2025-05-07T20:32:17.5402971Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5403227Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5403329Z                            module_map=module_map)
2025-05-07T20:32:17.5403491Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5403582Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5403653Z E       ^
2025-05-07T20:32:17.5404170Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5404249Z 
2025-05-07T20:32:17.5404658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5404664Z 
2025-05-07T20:32:17.5404762Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5405087Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5405161Z     T=16384,
2025-05-07T20:32:17.5405236Z     D=5120,
2025-05-07T20:32:17.5405315Z     scale_ub=1200.0,
2025-05-07T20:32:17.5405398Z     contiguous=True,
2025-05-07T20:32:17.5405477Z     compiled=False,
2025-05-07T20:32:17.5405603Z )
2025-05-07T20:32:17.5405816Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5405989Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:17.5405994Z 
2025-05-07T20:32:17.5406062Z     @given(
2025-05-07T20:32:17.5406180Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5406276Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5406384Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5406499Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5406608Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5406686Z     )
2025-05-07T20:32:17.5406925Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5407011Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5407088Z         self,
2025-05-07T20:32:17.5407160Z         T: int,
2025-05-07T20:32:17.5407233Z         D: int,
2025-05-07T20:32:17.5407333Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5407416Z         contiguous: bool,
2025-05-07T20:32:17.5407496Z         compiled: bool,
2025-05-07T20:32:17.5407572Z     ) -> None:
2025-05-07T20:32:17.5407659Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5407726Z     
2025-05-07T20:32:17.5407891Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5407967Z     
2025-05-07T20:32:17.5408055Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5408180Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5408264Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5408342Z         x0 = x[:, :D]
2025-05-07T20:32:17.5408422Z         x1 = x[:, D:]
2025-05-07T20:32:17.5408488Z     
2025-05-07T20:32:17.5408569Z         if contiguous:
2025-05-07T20:32:17.5408655Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5408738Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5408807Z     
2025-05-07T20:32:17.5408893Z         if scale_ub is not None:
2025-05-07T20:32:17.5408998Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5409132Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5409202Z             )
2025-05-07T20:32:17.5409275Z         else:
2025-05-07T20:32:17.5409366Z             scale_ub_tensor = None
2025-05-07T20:32:17.5409434Z     
2025-05-07T20:32:17.5409563Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5409647Z             op = silu_mul_quant
2025-05-07T20:32:17.5409726Z             if compiled:
2025-05-07T20:32:17.5409827Z                 op = torch.compile(op)
2025-05-07T20:32:17.5409933Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5410000Z     
2025-05-07T20:32:17.5410088Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5410092Z 
2025-05-07T20:32:17.5410183Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5410304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5410403Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5410501Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5411005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5411096Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5411456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5411760Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5412165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5412254Z     kernel = self.compile(
2025-05-07T20:32:17.5412632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5412800Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5412972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5412976Z 
2025-05-07T20:32:17.5413179Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a6a22b0>
2025-05-07T20:32:17.5413960Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5414478Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a92e5e0>}
2025-05-07T20:32:17.5415231Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5415425Z context = <triton._C.libtriton.ir.context object at 0x7feb2a942230>
2025-05-07T20:32:17.5415432Z 
2025-05-07T20:32:17.5415590Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5415852Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5415954Z                            module_map=module_map)
2025-05-07T20:32:17.5416113Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5416211Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5416285Z E       ^
2025-05-07T20:32:17.5416643Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5416648Z 
2025-05-07T20:32:17.5417058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5417062Z 
2025-05-07T20:32:17.5417158Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5417382Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5417455Z     T=1,
2025-05-07T20:32:17.5417526Z     D=7168,
2025-05-07T20:32:17.5417606Z     scale_ub=1200.0,
2025-05-07T20:32:17.5417684Z     contiguous=False,
2025-05-07T20:32:17.5417763Z     compiled=False,
2025-05-07T20:32:17.5417835Z )
2025-05-07T20:32:17.5418046Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5418211Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:17.5418221Z 
2025-05-07T20:32:17.5418295Z     @given(
2025-05-07T20:32:17.5418414Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5418510Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5418618Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5418730Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5418844Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5418913Z     )
2025-05-07T20:32:17.5419152Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5419248Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5419320Z         self,
2025-05-07T20:32:17.5419388Z         T: int,
2025-05-07T20:32:17.5419466Z         D: int,
2025-05-07T20:32:17.5419558Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5419691Z         contiguous: bool,
2025-05-07T20:32:17.5419773Z         compiled: bool,
2025-05-07T20:32:17.5419847Z     ) -> None:
2025-05-07T20:32:17.5419937Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5420005Z     
2025-05-07T20:32:17.5420243Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5420318Z     
2025-05-07T20:32:17.5420403Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5420524Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5420609Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5420684Z         x0 = x[:, :D]
2025-05-07T20:32:17.5420794Z         x1 = x[:, D:]
2025-05-07T20:32:17.5420864Z     
2025-05-07T20:32:17.5420945Z         if contiguous:
2025-05-07T20:32:17.5421035Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5421121Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5424269Z     
2025-05-07T20:32:17.5424368Z         if scale_ub is not None:
2025-05-07T20:32:17.5424471Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5424616Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5424684Z             )
2025-05-07T20:32:17.5424752Z         else:
2025-05-07T20:32:17.5424844Z             scale_ub_tensor = None
2025-05-07T20:32:17.5424920Z     
2025-05-07T20:32:17.5425051Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5425134Z             op = silu_mul_quant
2025-05-07T20:32:17.5425214Z             if compiled:
2025-05-07T20:32:17.5425310Z                 op = torch.compile(op)
2025-05-07T20:32:17.5425408Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5425479Z     
2025-05-07T20:32:17.5425565Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5425570Z 
2025-05-07T20:32:17.5425662Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5425787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5425888Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5425985Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5426491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5426582Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5426946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5427166Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5427500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5427589Z     kernel = self.compile(
2025-05-07T20:32:17.5427966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5428138Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5428264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5428272Z 
2025-05-07T20:32:17.5428474Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a84c880>
2025-05-07T20:32:17.5429254Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5429763Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a92e9d0>}
2025-05-07T20:32:17.5430581Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5430771Z context = <triton._C.libtriton.ir.context object at 0x7feb2a660930>
2025-05-07T20:32:17.5430838Z 
2025-05-07T20:32:17.5430998Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5431260Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5431433Z                            module_map=module_map)
2025-05-07T20:32:17.5431595Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5431691Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5431762Z E       ^
2025-05-07T20:32:17.5432113Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5432158Z 
2025-05-07T20:32:17.5432569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5432573Z 
2025-05-07T20:32:17.5432668Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5432890Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5432966Z     T=4096,
2025-05-07T20:32:17.5433037Z     D=7168,
2025-05-07T20:32:17.5433117Z     scale_ub=1200.0,
2025-05-07T20:32:17.5433200Z     contiguous=False,
2025-05-07T20:32:17.5433277Z     compiled=True,
2025-05-07T20:32:17.5433347Z )
2025-05-07T20:32:17.5433573Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5433742Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:17.5433749Z 
2025-05-07T20:32:17.5433824Z     @given(
2025-05-07T20:32:17.5433936Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5434034Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5434145Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5434256Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5434367Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5434436Z     )
2025-05-07T20:32:17.5434680Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5434772Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5434841Z         self,
2025-05-07T20:32:17.5434910Z         T: int,
2025-05-07T20:32:17.5434984Z         D: int,
2025-05-07T20:32:17.5435082Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5435169Z         contiguous: bool,
2025-05-07T20:32:17.5435248Z         compiled: bool,
2025-05-07T20:32:17.5435318Z     ) -> None:
2025-05-07T20:32:17.5435412Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5435478Z     
2025-05-07T20:32:17.5435641Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5435715Z     
2025-05-07T20:32:17.5435801Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5435917Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5436004Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5436080Z         x0 = x[:, :D]
2025-05-07T20:32:17.5436151Z         x1 = x[:, D:]
2025-05-07T20:32:17.5436223Z     
2025-05-07T20:32:17.5436301Z         if contiguous:
2025-05-07T20:32:17.5436388Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5436470Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5436538Z     
2025-05-07T20:32:17.5436630Z         if scale_ub is not None:
2025-05-07T20:32:17.5436728Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5436856Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5436934Z             )
2025-05-07T20:32:17.5437003Z         else:
2025-05-07T20:32:17.5437092Z             scale_ub_tensor = None
2025-05-07T20:32:17.5437165Z     
2025-05-07T20:32:17.5437289Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5437372Z             op = silu_mul_quant
2025-05-07T20:32:17.5437453Z             if compiled:
2025-05-07T20:32:17.5437544Z                 op = torch.compile(op)
2025-05-07T20:32:17.5437647Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5437767Z     
2025-05-07T20:32:17.5437853Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5437858Z 
2025-05-07T20:32:17.5437953Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5438074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5438244Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5438343Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5438706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.5438795Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.5439349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5439441Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5439796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5440013Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5440349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5440440Z     kernel = self.compile(
2025-05-07T20:32:17.5440820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5440992Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5441112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5441119Z 
2025-05-07T20:32:17.5441321Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a67e490>
2025-05-07T20:32:17.5442103Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5442611Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a654c10>}
2025-05-07T20:32:17.5443359Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5443552Z context = <triton._C.libtriton.ir.context object at 0x7feb2a5e88b0>
2025-05-07T20:32:17.5443557Z 
2025-05-07T20:32:17.5443714Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5443976Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5444078Z                            module_map=module_map)
2025-05-07T20:32:17.5444238Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5444330Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5444401Z E       ^
2025-05-07T20:32:17.5444755Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5444760Z 
2025-05-07T20:32:17.5445170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5445175Z 
2025-05-07T20:32:17.5445275Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5445492Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5445562Z     T=128,
2025-05-07T20:32:17.5445643Z     D=7168,
2025-05-07T20:32:17.5445719Z     scale_ub=1200.0,
2025-05-07T20:32:17.5445801Z     contiguous=False,
2025-05-07T20:32:17.5445880Z     compiled=True,
2025-05-07T20:32:17.5445947Z )
2025-05-07T20:32:17.5446167Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5446333Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:17.5446382Z 
2025-05-07T20:32:17.5446457Z     @given(
2025-05-07T20:32:17.5446572Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5446667Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5446888Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5447003Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5447110Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5447180Z     )
2025-05-07T20:32:17.5447421Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5447546Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5447623Z         self,
2025-05-07T20:32:17.5447695Z         T: int,
2025-05-07T20:32:17.5447765Z         D: int,
2025-05-07T20:32:17.5447860Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5447945Z         contiguous: bool,
2025-05-07T20:32:17.5448024Z         compiled: bool,
2025-05-07T20:32:17.5448099Z     ) -> None:
2025-05-07T20:32:17.5448191Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5448259Z     
2025-05-07T20:32:17.5448426Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5448494Z     
2025-05-07T20:32:17.5448578Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5448706Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5448789Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5448868Z         x0 = x[:, :D]
2025-05-07T20:32:17.5448941Z         x1 = x[:, D:]
2025-05-07T20:32:17.5449006Z     
2025-05-07T20:32:17.5449087Z         if contiguous:
2025-05-07T20:32:17.5449174Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5449256Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5449325Z     
2025-05-07T20:32:17.5449411Z         if scale_ub is not None:
2025-05-07T20:32:17.5449514Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5449651Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5449747Z             )
2025-05-07T20:32:17.5449825Z         else:
2025-05-07T20:32:17.5449934Z             scale_ub_tensor = None
2025-05-07T20:32:17.5450002Z     
2025-05-07T20:32:17.5450131Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5450223Z             op = silu_mul_quant
2025-05-07T20:32:17.5450302Z             if compiled:
2025-05-07T20:32:17.5450397Z                 op = torch.compile(op)
2025-05-07T20:32:17.5450497Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5450565Z     
2025-05-07T20:32:17.5450652Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5450660Z 
2025-05-07T20:32:17.5450750Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5450872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5450971Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5451064Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5451428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.5451516Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.5452005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5452102Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5452454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5452672Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5453012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5453101Z     kernel = self.compile(
2025-05-07T20:32:17.5453479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5453648Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5453818Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5453823Z 
2025-05-07T20:32:17.5454029Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a606610>
2025-05-07T20:32:17.5454885Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5455397Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a5e6820>}
2025-05-07T20:32:17.5456180Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5456373Z context = <triton._C.libtriton.ir.context object at 0x7feb2a509a70>
2025-05-07T20:32:17.5456383Z 
2025-05-07T20:32:17.5456542Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5456798Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5456907Z                            module_map=module_map)
2025-05-07T20:32:17.5457066Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5457158Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5457231Z E       ^
2025-05-07T20:32:17.5457589Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5457601Z 
2025-05-07T20:32:17.5458010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5458015Z 
2025-05-07T20:32:17.5458112Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5458332Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5458404Z     T=2048,
2025-05-07T20:32:17.5458473Z     D=7168,
2025-05-07T20:32:17.5458546Z     scale_ub=None,
2025-05-07T20:32:17.5458628Z     contiguous=True,
2025-05-07T20:32:17.5458710Z     compiled=True,
2025-05-07T20:32:17.5458777Z )
2025-05-07T20:32:17.5458991Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5459156Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.5459160Z 
2025-05-07T20:32:17.5459231Z     @given(
2025-05-07T20:32:17.5459346Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5459440Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5459553Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5459664Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5459770Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5459841Z     )
2025-05-07T20:32:17.5460083Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5460175Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5460244Z         self,
2025-05-07T20:32:17.5460313Z         T: int,
2025-05-07T20:32:17.5460393Z         D: int,
2025-05-07T20:32:17.5460486Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5460568Z         contiguous: bool,
2025-05-07T20:32:17.5460652Z         compiled: bool,
2025-05-07T20:32:17.5460725Z     ) -> None:
2025-05-07T20:32:17.5460812Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5460885Z     
2025-05-07T20:32:17.5461047Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5461121Z     
2025-05-07T20:32:17.5461206Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5461325Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5461408Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5461484Z         x0 = x[:, :D]
2025-05-07T20:32:17.5461603Z         x1 = x[:, D:]
2025-05-07T20:32:17.5461672Z     
2025-05-07T20:32:17.5461748Z         if contiguous:
2025-05-07T20:32:17.5461834Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5461921Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5461987Z     
2025-05-07T20:32:17.5462145Z         if scale_ub is not None:
2025-05-07T20:32:17.5462247Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5462375Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5462444Z             )
2025-05-07T20:32:17.5462517Z         else:
2025-05-07T20:32:17.5462648Z             scale_ub_tensor = None
2025-05-07T20:32:17.5462715Z     
2025-05-07T20:32:17.5462842Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5462926Z             op = silu_mul_quant
2025-05-07T20:32:17.5463007Z             if compiled:
2025-05-07T20:32:17.5463100Z                 op = torch.compile(op)
2025-05-07T20:32:17.5463199Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5463272Z     
2025-05-07T20:32:17.5463356Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5463361Z 
2025-05-07T20:32:17.5463451Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5463580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5463676Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5463767Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5464133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.5464219Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.5464719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5464812Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5465164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5465385Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5465717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5465810Z     kernel = self.compile(
2025-05-07T20:32:17.5466188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5466358Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5466479Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5466486Z 
2025-05-07T20:32:17.5466688Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a60b670>
2025-05-07T20:32:17.5467469Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5467981Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a5314c0>}
2025-05-07T20:32:17.5468729Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5468921Z context = <triton._C.libtriton.ir.context object at 0x7feb2a5439f0>
2025-05-07T20:32:17.5468926Z 
2025-05-07T20:32:17.5469087Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5469346Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5469448Z                            module_map=module_map)
2025-05-07T20:32:17.5469606Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5469749Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5469878Z E       ^
2025-05-07T20:32:17.5470230Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5470235Z 
2025-05-07T20:32:17.5470744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5470749Z 
2025-05-07T20:32:17.5470845Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5471067Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5471179Z     T=16384,
2025-05-07T20:32:17.5471250Z     D=5120,
2025-05-07T20:32:17.5471329Z     scale_ub=None,
2025-05-07T20:32:17.5471409Z     contiguous=False,
2025-05-07T20:32:17.5471487Z     compiled=False,
2025-05-07T20:32:17.5471557Z )
2025-05-07T20:32:17.5471769Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5471940Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:17.5471950Z 
2025-05-07T20:32:17.5472022Z     @given(
2025-05-07T20:32:17.5472134Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5472229Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5472344Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5472456Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5472570Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5472638Z     )
2025-05-07T20:32:17.5472879Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5472972Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5473044Z         self,
2025-05-07T20:32:17.5473115Z         T: int,
2025-05-07T20:32:17.5473187Z         D: int,
2025-05-07T20:32:17.5473280Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5473364Z         contiguous: bool,
2025-05-07T20:32:17.5473444Z         compiled: bool,
2025-05-07T20:32:17.5473522Z     ) -> None:
2025-05-07T20:32:17.5473612Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5473683Z     
2025-05-07T20:32:17.5473846Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5473917Z     
2025-05-07T20:32:17.5474007Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5474123Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5475963Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.5475974Z 
2025-05-07T20:32:17.5476087Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:17.5476091Z 
2025-05-07T20:32:17.5476190Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5476417Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5476488Z     T=4096,
2025-05-07T20:32:17.5476556Z     D=7168,
2025-05-07T20:32:17.5476631Z     scale_ub=1200.0,
2025-05-07T20:32:17.5476711Z     contiguous=True,
2025-05-07T20:32:17.5476787Z     compiled=True,
2025-05-07T20:32:17.5476854Z )
2025-05-07T20:32:17.5477064Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5477233Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:17.5477237Z 
2025-05-07T20:32:17.5477309Z     @given(
2025-05-07T20:32:17.5477425Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5477517Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5477679Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5477789Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5477895Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5477967Z     )
2025-05-07T20:32:17.5478278Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5478368Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5478439Z         self,
2025-05-07T20:32:17.5478511Z         T: int,
2025-05-07T20:32:17.5478582Z         D: int,
2025-05-07T20:32:17.5478678Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5478801Z         contiguous: bool,
2025-05-07T20:32:17.5478879Z         compiled: bool,
2025-05-07T20:32:17.5478955Z     ) -> None:
2025-05-07T20:32:17.5479042Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5479112Z     
2025-05-07T20:32:17.5479272Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5479341Z     
2025-05-07T20:32:17.5479434Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5479555Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5481395Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.5481407Z 
2025-05-07T20:32:17.5481519Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:17.5481524Z 
2025-05-07T20:32:17.5481620Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5481842Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5481913Z     T=16384,
2025-05-07T20:32:17.5481983Z     D=7168,
2025-05-07T20:32:17.5482063Z     scale_ub=None,
2025-05-07T20:32:17.5482142Z     contiguous=False,
2025-05-07T20:32:17.5482221Z     compiled=False,
2025-05-07T20:32:17.5482289Z )
2025-05-07T20:32:17.5482500Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5482670Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:17.5482674Z 
2025-05-07T20:32:17.5482746Z     @given(
2025-05-07T20:32:17.5482858Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5482956Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5483064Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5483174Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5483283Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5483359Z     )
2025-05-07T20:32:17.5483601Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5483689Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5483758Z         self,
2025-05-07T20:32:17.5483828Z         T: int,
2025-05-07T20:32:17.5483899Z         D: int,
2025-05-07T20:32:17.5483995Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5484077Z         contiguous: bool,
2025-05-07T20:32:17.5484155Z         compiled: bool,
2025-05-07T20:32:17.5484232Z     ) -> None:
2025-05-07T20:32:17.5484321Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5484390Z     
2025-05-07T20:32:17.5484554Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5486340Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.5486466Z 
2025-05-07T20:32:17.5486583Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.5486588Z 
2025-05-07T20:32:17.5486683Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5486906Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5486975Z     T=2048,
2025-05-07T20:32:17.5487085Z     D=7168,
2025-05-07T20:32:17.5487168Z     scale_ub=1200.0,
2025-05-07T20:32:17.5487244Z     contiguous=True,
2025-05-07T20:32:17.5487321Z     compiled=True,
2025-05-07T20:32:17.5487390Z )
2025-05-07T20:32:17.5487604Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5487769Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:17.5487779Z 
2025-05-07T20:32:17.5487852Z     @given(
2025-05-07T20:32:17.5487964Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5488056Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5488171Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5488281Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5488391Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5488460Z     )
2025-05-07T20:32:17.5491410Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5491529Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5491602Z         self,
2025-05-07T20:32:17.5491675Z         T: int,
2025-05-07T20:32:17.5491747Z         D: int,
2025-05-07T20:32:17.5491844Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5491929Z         contiguous: bool,
2025-05-07T20:32:17.5492015Z         compiled: bool,
2025-05-07T20:32:17.5492094Z     ) -> None:
2025-05-07T20:32:17.5492183Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5492254Z     
2025-05-07T20:32:17.5492418Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5492487Z     
2025-05-07T20:32:17.5492581Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5492701Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5494496Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.5494506Z 
2025-05-07T20:32:17.5494620Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:17.5494625Z 
2025-05-07T20:32:17.5494722Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5494941Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5495020Z     T=2048,
2025-05-07T20:32:17.5495097Z     D=7168,
2025-05-07T20:32:17.5495173Z     scale_ub=None,
2025-05-07T20:32:17.5495252Z     contiguous=True,
2025-05-07T20:32:17.5495331Z     compiled=False,
2025-05-07T20:32:17.5495402Z )
2025-05-07T20:32:17.5495613Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5495783Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:17.5495788Z 
2025-05-07T20:32:17.5495860Z     @given(
2025-05-07T20:32:17.5495972Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5496068Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5496176Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5496358Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5496467Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5496536Z     )
2025-05-07T20:32:17.5496818Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5496909Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5496982Z         self,
2025-05-07T20:32:17.5497058Z         T: int,
2025-05-07T20:32:17.5497128Z         D: int,
2025-05-07T20:32:17.5497221Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5497347Z         contiguous: bool,
2025-05-07T20:32:17.5497426Z         compiled: bool,
2025-05-07T20:32:17.5497501Z     ) -> None:
2025-05-07T20:32:17.5497594Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5497662Z     
2025-05-07T20:32:17.5497829Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5497899Z     
2025-05-07T20:32:17.5497984Z >       x_sign = torch.sign(x)
2025-05-07T20:32:17.5499771Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.5499860Z 
2025-05-07T20:32:17.5499976Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:17.5499981Z 
2025-05-07T20:32:17.5500082Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5500303Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5500373Z     T=1,
2025-05-07T20:32:17.5500454Z     D=7168,
2025-05-07T20:32:17.5500535Z     scale_ub=1200.0,
2025-05-07T20:32:17.5500613Z     contiguous=True,
2025-05-07T20:32:17.5500692Z     compiled=False,
2025-05-07T20:32:17.5500761Z )
2025-05-07T20:32:17.5500974Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5501137Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:17.5501142Z 
2025-05-07T20:32:17.5501214Z     @given(
2025-05-07T20:32:17.5501330Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5501423Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5501539Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5501655Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5501763Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5501834Z     )
2025-05-07T20:32:17.5502078Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5502169Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5502244Z         self,
2025-05-07T20:32:17.5502316Z         T: int,
2025-05-07T20:32:17.5502393Z         D: int,
2025-05-07T20:32:17.5502488Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5502571Z         contiguous: bool,
2025-05-07T20:32:17.5502655Z         compiled: bool,
2025-05-07T20:32:17.5502736Z     ) -> None:
2025-05-07T20:32:17.5502827Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5502897Z     
2025-05-07T20:32:17.5503061Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5503131Z     
2025-05-07T20:32:17.5503224Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5503345Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5503430Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5503508Z         x0 = x[:, :D]
2025-05-07T20:32:17.5503581Z         x1 = x[:, D:]
2025-05-07T20:32:17.5503649Z     
2025-05-07T20:32:17.5504003Z         if contiguous:
2025-05-07T20:32:17.5504258Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5504364Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5504433Z     
2025-05-07T20:32:17.5504520Z         if scale_ub is not None:
2025-05-07T20:32:17.5504623Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5504838Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5504910Z             )
2025-05-07T20:32:17.5504980Z         else:
2025-05-07T20:32:17.5505076Z             scale_ub_tensor = None
2025-05-07T20:32:17.5505145Z     
2025-05-07T20:32:17.5505270Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5505421Z             op = silu_mul_quant
2025-05-07T20:32:17.5505502Z             if compiled:
2025-05-07T20:32:17.5505602Z                 op = torch.compile(op)
2025-05-07T20:32:17.5505702Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5505768Z     
2025-05-07T20:32:17.5505858Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5505862Z 
2025-05-07T20:32:17.5505964Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5506088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5506204Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5506307Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5506839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5506933Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5507360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5507590Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5507923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5508014Z     kernel = self.compile(
2025-05-07T20:32:17.5508392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5508566Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5508689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5508697Z 
2025-05-07T20:32:17.5508900Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a43caf0>
2025-05-07T20:32:17.5509684Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5510251Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a1e8040>}
2025-05-07T20:32:17.5510998Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5511199Z context = <triton._C.libtriton.ir.context object at 0x7feb2a1eadf0>
2025-05-07T20:32:17.5511203Z 
2025-05-07T20:32:17.5511368Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5511660Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5511786Z                            module_map=module_map)
2025-05-07T20:32:17.5511950Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5512055Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5512131Z E       ^
2025-05-07T20:32:17.5512484Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5512489Z 
2025-05-07T20:32:17.5512900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5512954Z 
2025-05-07T20:32:17.5513054Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5513283Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5513356Z     T=128,
2025-05-07T20:32:17.5513466Z     D=5120,
2025-05-07T20:32:17.5513548Z     scale_ub=None,
2025-05-07T20:32:17.5513627Z     contiguous=True,
2025-05-07T20:32:17.5513704Z     compiled=False,
2025-05-07T20:32:17.5513774Z )
2025-05-07T20:32:17.5513992Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5514201Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:17.5514206Z 
2025-05-07T20:32:17.5514285Z     @given(
2025-05-07T20:32:17.5514396Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5514494Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5514604Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5514719Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5514831Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5514901Z     )
2025-05-07T20:32:17.5515147Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5515242Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5515311Z         self,
2025-05-07T20:32:17.5515381Z         T: int,
2025-05-07T20:32:17.5515459Z         D: int,
2025-05-07T20:32:17.5515554Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5515640Z         contiguous: bool,
2025-05-07T20:32:17.5515842Z         compiled: bool,
2025-05-07T20:32:17.5515917Z     ) -> None:
2025-05-07T20:32:17.5516008Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5516074Z     
2025-05-07T20:32:17.5516235Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5516311Z     
2025-05-07T20:32:17.5516400Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5516522Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5516611Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5516686Z         x0 = x[:, :D]
2025-05-07T20:32:17.5516760Z         x1 = x[:, D:]
2025-05-07T20:32:17.5516832Z     
2025-05-07T20:32:17.5516914Z         if contiguous:
2025-05-07T20:32:17.5516999Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5517086Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5517151Z     
2025-05-07T20:32:17.5517240Z         if scale_ub is not None:
2025-05-07T20:32:17.5517343Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5517480Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5517556Z             )
2025-05-07T20:32:17.5517627Z         else:
2025-05-07T20:32:17.5517716Z             scale_ub_tensor = None
2025-05-07T20:32:17.5517789Z     
2025-05-07T20:32:17.5517913Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5517996Z             op = silu_mul_quant
2025-05-07T20:32:17.5518082Z             if compiled:
2025-05-07T20:32:17.5518177Z                 op = torch.compile(op)
2025-05-07T20:32:17.5518277Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5518348Z     
2025-05-07T20:32:17.5518436Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5518443Z 
2025-05-07T20:32:17.5518541Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5518664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5518758Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5518856Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5519361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5519454Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5519813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5520079Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5520421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5520512Z     kernel = self.compile(
2025-05-07T20:32:17.5520926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5521104Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5521225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5521271Z 
2025-05-07T20:32:17.5521477Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a206580>
2025-05-07T20:32:17.5522258Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5522772Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a1e89d0>}
2025-05-07T20:32:17.5523526Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5523720Z context = <triton._C.libtriton.ir.context object at 0x7feb2a477570>
2025-05-07T20:32:17.5523725Z 
2025-05-07T20:32:17.5523929Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5524191Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5524295Z                            module_map=module_map)
2025-05-07T20:32:17.5524458Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5524556Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5524634Z E       ^
2025-05-07T20:32:17.5524990Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5524994Z 
2025-05-07T20:32:17.5525406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5525411Z 
2025-05-07T20:32:17.5525511Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5525730Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5525800Z     T=128,
2025-05-07T20:32:17.5525880Z     D=7168,
2025-05-07T20:32:17.5525958Z     scale_ub=None,
2025-05-07T20:32:17.5526042Z     contiguous=True,
2025-05-07T20:32:17.5526121Z     compiled=False,
2025-05-07T20:32:17.5526189Z )
2025-05-07T20:32:17.5526402Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5526565Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:17.5526572Z 
2025-05-07T20:32:17.5526643Z     @given(
2025-05-07T20:32:17.5526760Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5526857Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5526968Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5527089Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5527199Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5527273Z     )
2025-05-07T20:32:17.5527520Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5527614Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5527686Z         self,
2025-05-07T20:32:17.5527758Z         T: int,
2025-05-07T20:32:17.5527828Z         D: int,
2025-05-07T20:32:17.5527927Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5528010Z         contiguous: bool,
2025-05-07T20:32:17.5528090Z         compiled: bool,
2025-05-07T20:32:17.5528214Z     ) -> None:
2025-05-07T20:32:17.5528305Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5528370Z     
2025-05-07T20:32:17.5528540Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5528608Z     
2025-05-07T20:32:17.5528743Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5528864Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5528950Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5529028Z         x0 = x[:, :D]
2025-05-07T20:32:17.5529103Z         x1 = x[:, D:]
2025-05-07T20:32:17.5529172Z     
2025-05-07T20:32:17.5529256Z         if contiguous:
2025-05-07T20:32:17.5529383Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5529472Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5529545Z     
2025-05-07T20:32:17.5529632Z         if scale_ub is not None:
2025-05-07T20:32:17.5529734Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5529896Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5529980Z             )
2025-05-07T20:32:17.5530067Z         else:
2025-05-07T20:32:17.5530162Z             scale_ub_tensor = None
2025-05-07T20:32:17.5530231Z     
2025-05-07T20:32:17.5530359Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5530447Z             op = silu_mul_quant
2025-05-07T20:32:17.5530525Z             if compiled:
2025-05-07T20:32:17.5530622Z                 op = torch.compile(op)
2025-05-07T20:32:17.5530723Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5530794Z     
2025-05-07T20:32:17.5530881Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5530961Z 
2025-05-07T20:32:17.5531057Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5531177Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5531277Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5531372Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5531874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5531970Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5532321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5532549Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5532881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5532971Z     kernel = self.compile(
2025-05-07T20:32:17.5533356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5533525Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5533649Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5533654Z 
2025-05-07T20:32:17.5533853Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a205550>
2025-05-07T20:32:17.5534637Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5535141Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a301430>}
2025-05-07T20:32:17.5535891Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5536092Z context = <triton._C.libtriton.ir.context object at 0x7feb2a2eeab0>
2025-05-07T20:32:17.5536097Z 
2025-05-07T20:32:17.5536256Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5536563Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5536665Z                            module_map=module_map)
2025-05-07T20:32:17.5536821Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5536954Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5537024Z E       ^
2025-05-07T20:32:17.5537374Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5537379Z 
2025-05-07T20:32:17.5537795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5537836Z 
2025-05-07T20:32:17.5537935Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5538153Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5538227Z     T=2048,
2025-05-07T20:32:17.5538295Z     D=7168,
2025-05-07T20:32:17.5538382Z     scale_ub=1200.0,
2025-05-07T20:32:17.5538463Z     contiguous=True,
2025-05-07T20:32:17.5538542Z     compiled=False,
2025-05-07T20:32:17.5538612Z )
2025-05-07T20:32:17.5538823Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5538995Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:17.5539003Z 
2025-05-07T20:32:17.5539075Z     @given(
2025-05-07T20:32:17.5539188Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5539286Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5539443Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5539557Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5539671Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5539740Z     )
2025-05-07T20:32:17.5539984Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5540075Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5540170Z         self,
2025-05-07T20:32:17.5540252Z         T: int,
2025-05-07T20:32:17.5540343Z         D: int,
2025-05-07T20:32:17.5540451Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5540539Z         contiguous: bool,
2025-05-07T20:32:17.5540623Z         compiled: bool,
2025-05-07T20:32:17.5540697Z     ) -> None:
2025-05-07T20:32:17.5540795Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5540863Z     
2025-05-07T20:32:17.5541026Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5542816Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.5542827Z 
2025-05-07T20:32:17.5542940Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.5542944Z 
2025-05-07T20:32:17.5543046Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5543270Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5543339Z     T=1,
2025-05-07T20:32:17.5543415Z     D=5120,
2025-05-07T20:32:17.5543493Z     scale_ub=1200.0,
2025-05-07T20:32:17.5543575Z     contiguous=True,
2025-05-07T20:32:17.5543659Z     compiled=False,
2025-05-07T20:32:17.5543729Z )
2025-05-07T20:32:17.5543944Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5544101Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:17.5544106Z 
2025-05-07T20:32:17.5544175Z     @given(
2025-05-07T20:32:17.5544292Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5544433Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5544541Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5544658Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5544805Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5544880Z     )
2025-05-07T20:32:17.5545120Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5548698Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5548786Z         self,
2025-05-07T20:32:17.5548933Z         T: int,
2025-05-07T20:32:17.5549011Z         D: int,
2025-05-07T20:32:17.5549108Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5549196Z         contiguous: bool,
2025-05-07T20:32:17.5549277Z         compiled: bool,
2025-05-07T20:32:17.5549353Z     ) -> None:
2025-05-07T20:32:17.5549450Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5549519Z     
2025-05-07T20:32:17.5549693Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5549763Z     
2025-05-07T20:32:17.5549919Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5550040Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5550131Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5550205Z         x0 = x[:, :D]
2025-05-07T20:32:17.5550279Z         x1 = x[:, D:]
2025-05-07T20:32:17.5550350Z     
2025-05-07T20:32:17.5550430Z         if contiguous:
2025-05-07T20:32:17.5550518Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5550606Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5550732Z     
2025-05-07T20:32:17.5550826Z         if scale_ub is not None:
2025-05-07T20:32:17.5550927Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5551060Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5551136Z             )
2025-05-07T20:32:17.5551209Z         else:
2025-05-07T20:32:17.5551300Z             scale_ub_tensor = None
2025-05-07T20:32:17.5551374Z     
2025-05-07T20:32:17.5551500Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5551585Z             op = silu_mul_quant
2025-05-07T20:32:17.5551668Z             if compiled:
2025-05-07T20:32:17.5551770Z                 op = torch.compile(op)
2025-05-07T20:32:17.5551874Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5551949Z     
2025-05-07T20:32:17.5552036Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5552041Z 
2025-05-07T20:32:17.5552141Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5552268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5552369Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5552468Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5552970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5553062Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5553423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5553641Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5553985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5554075Z     kernel = self.compile(
2025-05-07T20:32:17.5554450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5554630Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5554752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5554757Z 
2025-05-07T20:32:17.5554962Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a2d0e50>
2025-05-07T20:32:17.5555739Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5556336Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a221160>}
2025-05-07T20:32:17.5557090Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5557320Z context = <triton._C.libtriton.ir.context object at 0x7feb2a229b70>
2025-05-07T20:32:17.5557325Z 
2025-05-07T20:32:17.5557488Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5557750Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5557856Z                            module_map=module_map)
2025-05-07T20:32:17.5558018Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5558114Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5558188Z E       ^
2025-05-07T20:32:17.5558544Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5558548Z 
2025-05-07T20:32:17.5558956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5558961Z 
2025-05-07T20:32:17.5559110Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5559334Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5559408Z     T=2048,
2025-05-07T20:32:17.5559487Z     D=5120,
2025-05-07T20:32:17.5559564Z     scale_ub=None,
2025-05-07T20:32:17.5559645Z     contiguous=True,
2025-05-07T20:32:17.5559745Z     compiled=False,
2025-05-07T20:32:17.5559823Z )
2025-05-07T20:32:17.5560061Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5560228Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:17.5560232Z 
2025-05-07T20:32:17.5560305Z     @given(
2025-05-07T20:32:17.5560428Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5560524Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5560637Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5560753Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5560870Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5560940Z     )
2025-05-07T20:32:17.5561181Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5561271Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5561348Z         self,
2025-05-07T20:32:17.5561418Z         T: int,
2025-05-07T20:32:17.5561489Z         D: int,
2025-05-07T20:32:17.5561589Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5561675Z         contiguous: bool,
2025-05-07T20:32:17.5561757Z         compiled: bool,
2025-05-07T20:32:17.5561836Z     ) -> None:
2025-05-07T20:32:17.5561927Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5561999Z     
2025-05-07T20:32:17.5562167Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5562238Z     
2025-05-07T20:32:17.5562328Z >       x_sign = torch.sign(x)
2025-05-07T20:32:17.5564148Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.5564202Z 
2025-05-07T20:32:17.5564320Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:17.5564325Z 
2025-05-07T20:32:17.5564423Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5564684Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5564758Z     T=16384,
2025-05-07T20:32:17.5564829Z     D=5120,
2025-05-07T20:32:17.5564908Z     scale_ub=None,
2025-05-07T20:32:17.5564994Z     contiguous=True,
2025-05-07T20:32:17.5565074Z     compiled=False,
2025-05-07T20:32:17.5565208Z )
2025-05-07T20:32:17.5565420Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5565588Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:17.5565593Z 
2025-05-07T20:32:17.5565668Z     @given(
2025-05-07T20:32:17.5565781Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5565878Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5565988Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5566098Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5566207Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5566281Z     )
2025-05-07T20:32:17.5566519Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5566609Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5566683Z         self,
2025-05-07T20:32:17.5566755Z         T: int,
2025-05-07T20:32:17.5566878Z         D: int,
2025-05-07T20:32:17.5566972Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5567055Z         contiguous: bool,
2025-05-07T20:32:17.5567139Z         compiled: bool,
2025-05-07T20:32:17.5567213Z     ) -> None:
2025-05-07T20:32:17.5567304Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5567375Z     
2025-05-07T20:32:17.5567537Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5569323Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.5569332Z 
2025-05-07T20:32:17.5569446Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.5569450Z 
2025-05-07T20:32:17.5569551Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5569776Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5569846Z     T=4096,
2025-05-07T20:32:17.5569941Z     D=5120,
2025-05-07T20:32:17.5570024Z     scale_ub=None,
2025-05-07T20:32:17.5570119Z     contiguous=True,
2025-05-07T20:32:17.5570210Z     compiled=False,
2025-05-07T20:32:17.5570277Z )
2025-05-07T20:32:17.5570485Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5570655Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:17.5570659Z 
2025-05-07T20:32:17.5570732Z     @given(
2025-05-07T20:32:17.5570848Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5570942Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5571059Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5571175Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5571284Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5571352Z     )
2025-05-07T20:32:17.5571596Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5571734Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5571809Z         self,
2025-05-07T20:32:17.5571879Z         T: int,
2025-05-07T20:32:17.5571949Z         D: int,
2025-05-07T20:32:17.5572047Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5572133Z         contiguous: bool,
2025-05-07T20:32:17.5572252Z         compiled: bool,
2025-05-07T20:32:17.5572334Z     ) -> None:
2025-05-07T20:32:17.5572424Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5572489Z     
2025-05-07T20:32:17.5572655Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5574422Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.5574469Z 
2025-05-07T20:32:17.5574589Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.5574596Z 
2025-05-07T20:32:17.5574692Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5574914Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5574988Z     T=2048,
2025-05-07T20:32:17.5575056Z     D=5120,
2025-05-07T20:32:17.5575137Z     scale_ub=None,
2025-05-07T20:32:17.5575262Z     contiguous=False,
2025-05-07T20:32:17.5575344Z     compiled=False,
2025-05-07T20:32:17.5575417Z )
2025-05-07T20:32:17.5575625Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5575790Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:17.5575795Z 
2025-05-07T20:32:17.5575874Z     @given(
2025-05-07T20:32:17.5575989Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5576083Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5576197Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5576309Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5576421Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5576491Z     )
2025-05-07T20:32:17.5576732Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5576824Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5576900Z         self,
2025-05-07T20:32:17.5576972Z         T: int,
2025-05-07T20:32:17.5577045Z         D: int,
2025-05-07T20:32:17.5577136Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5577222Z         contiguous: bool,
2025-05-07T20:32:17.5577303Z         compiled: bool,
2025-05-07T20:32:17.5577376Z     ) -> None:
2025-05-07T20:32:17.5577463Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5577537Z     
2025-05-07T20:32:17.5577697Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5579481Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.5579491Z 
2025-05-07T20:32:17.5579603Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.5579608Z 
2025-05-07T20:32:17.5579709Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5579930Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5580044Z     T=4096,
2025-05-07T20:32:17.5580120Z     D=7168,
2025-05-07T20:32:17.5580201Z     scale_ub=None,
2025-05-07T20:32:17.5580281Z     contiguous=True,
2025-05-07T20:32:17.5580363Z     compiled=True,
2025-05-07T20:32:17.5580435Z )
2025-05-07T20:32:17.5580682Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5580852Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.5580857Z 
2025-05-07T20:32:17.5580928Z     @given(
2025-05-07T20:32:17.5581042Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5581179Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5581288Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5581405Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5581512Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5581580Z     )
2025-05-07T20:32:17.5581827Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5581919Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5581998Z         self,
2025-05-07T20:32:17.5582070Z         T: int,
2025-05-07T20:32:17.5582144Z         D: int,
2025-05-07T20:32:17.5582243Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5582326Z         contiguous: bool,
2025-05-07T20:32:17.5582405Z         compiled: bool,
2025-05-07T20:32:17.5582481Z     ) -> None:
2025-05-07T20:32:17.5582571Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5582644Z     
2025-05-07T20:32:17.5582853Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5584633Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.5584642Z 
2025-05-07T20:32:17.5584761Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.5584766Z 
2025-05-07T20:32:17.5584864Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5585085Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5585162Z     T=2048,
2025-05-07T20:32:17.5585236Z     D=5120,
2025-05-07T20:32:17.5585322Z     scale_ub=1200.0,
2025-05-07T20:32:17.5585402Z     contiguous=False,
2025-05-07T20:32:17.5585481Z     compiled=False,
2025-05-07T20:32:17.5585556Z )
2025-05-07T20:32:17.5585763Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5585931Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:17.5585937Z 
2025-05-07T20:32:17.5586015Z     @given(
2025-05-07T20:32:17.5586129Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5586224Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5586338Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5586448Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5586561Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5586632Z     )
2025-05-07T20:32:17.5586869Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5586966Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5587040Z         self,
2025-05-07T20:32:17.5587112Z         T: int,
2025-05-07T20:32:17.5587187Z         D: int,
2025-05-07T20:32:17.5587279Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5587362Z         contiguous: bool,
2025-05-07T20:32:17.5587447Z         compiled: bool,
2025-05-07T20:32:17.5587522Z     ) -> None:
2025-05-07T20:32:17.5587662Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5587734Z     
2025-05-07T20:32:17.5587895Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5589722Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.5589775Z 
2025-05-07T20:32:17.5589966Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.5589972Z 
2025-05-07T20:32:17.5590073Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5590295Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5590367Z     T=4096,
2025-05-07T20:32:17.5590439Z     D=7168,
2025-05-07T20:32:17.5590518Z     scale_ub=1200.0,
2025-05-07T20:32:17.5590595Z     contiguous=True,
2025-05-07T20:32:17.5590681Z     compiled=False,
2025-05-07T20:32:17.5590747Z )
2025-05-07T20:32:17.5590957Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5591128Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:17.5591132Z 
2025-05-07T20:32:17.5591205Z     @given(
2025-05-07T20:32:17.5591371Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5591467Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5591579Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5591698Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5591805Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5591877Z     )
2025-05-07T20:32:17.5592120Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5592212Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5592288Z         self,
2025-05-07T20:32:17.5592360Z         T: int,
2025-05-07T20:32:17.5592436Z         D: int,
2025-05-07T20:32:17.5592533Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5592617Z         contiguous: bool,
2025-05-07T20:32:17.5592696Z         compiled: bool,
2025-05-07T20:32:17.5592774Z     ) -> None:
2025-05-07T20:32:17.5592865Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5592939Z     
2025-05-07T20:32:17.5593105Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5594886Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.5594894Z 
2025-05-07T20:32:17.5595010Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.5595015Z 
2025-05-07T20:32:17.5595114Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5595335Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5595421Z     T=16384,
2025-05-07T20:32:17.5595493Z     D=7168,
2025-05-07T20:32:17.5595571Z     scale_ub=None,
2025-05-07T20:32:17.5595652Z     contiguous=False,
2025-05-07T20:32:17.5595729Z     compiled=True,
2025-05-07T20:32:17.5595802Z )
2025-05-07T20:32:17.5596017Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5596231Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:17.5596235Z 
2025-05-07T20:32:17.5596316Z     @given(
2025-05-07T20:32:17.5596426Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5596583Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5596695Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5596805Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5596914Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5596984Z     )
2025-05-07T20:32:17.5597227Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5597360Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5597433Z         self,
2025-05-07T20:32:17.5597503Z         T: int,
2025-05-07T20:32:17.5597578Z         D: int,
2025-05-07T20:32:17.5597671Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5597754Z         contiguous: bool,
2025-05-07T20:32:17.5597840Z         compiled: bool,
2025-05-07T20:32:17.5597911Z     ) -> None:
2025-05-07T20:32:17.5598001Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5598070Z     
2025-05-07T20:32:17.5598230Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5600126Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.5600137Z 
2025-05-07T20:32:17.5600255Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.5600263Z 
2025-05-07T20:32:17.5600363Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5600582Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5600656Z     T=4096,
2025-05-07T20:32:17.5600730Z     D=7168,
2025-05-07T20:32:17.5600804Z     scale_ub=None,
2025-05-07T20:32:17.5600887Z     contiguous=True,
2025-05-07T20:32:17.5600968Z     compiled=False,
2025-05-07T20:32:17.5601039Z )
2025-05-07T20:32:17.5601247Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5601416Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:17.5601426Z 
2025-05-07T20:32:17.5601501Z     @given(
2025-05-07T20:32:17.5601617Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5601709Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5601818Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5601931Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5602039Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5602112Z     )
2025-05-07T20:32:17.5602353Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5602441Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5602521Z         self,
2025-05-07T20:32:17.5602592Z         T: int,
2025-05-07T20:32:17.5602664Z         D: int,
2025-05-07T20:32:17.5602765Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5602848Z         contiguous: bool,
2025-05-07T20:32:17.5602928Z         compiled: bool,
2025-05-07T20:32:17.5603005Z     ) -> None:
2025-05-07T20:32:17.5603099Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5603167Z     
2025-05-07T20:32:17.5603333Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5605488Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.5605545Z 
2025-05-07T20:32:17.5605678Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.5605683Z 
2025-05-07T20:32:17.5605789Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5606104Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5606183Z     T=16384,
2025-05-07T20:32:17.5606257Z     D=7168,
2025-05-07T20:32:17.5606341Z     scale_ub=None,
2025-05-07T20:32:17.5606426Z     contiguous=True,
2025-05-07T20:32:17.5606510Z     compiled=False,
2025-05-07T20:32:17.5606584Z )
2025-05-07T20:32:17.5606830Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5607024Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:17.5607029Z 
2025-05-07T20:32:17.5607106Z     @given(
2025-05-07T20:32:17.5607232Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5607332Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5607455Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5607578Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5607698Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5607838Z     )
2025-05-07T20:32:17.5608081Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5608172Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5608246Z         self,
2025-05-07T20:32:17.5608318Z         T: int,
2025-05-07T20:32:17.5608396Z         D: int,
2025-05-07T20:32:17.5608490Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5608579Z         contiguous: bool,
2025-05-07T20:32:17.5608661Z         compiled: bool,
2025-05-07T20:32:17.5608734Z     ) -> None:
2025-05-07T20:32:17.5608824Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5608892Z     
2025-05-07T20:32:17.5609056Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5610847Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.5610855Z 
2025-05-07T20:32:17.5610982Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.5610987Z 
2025-05-07T20:32:17.5611085Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5611305Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5611385Z     T=16384,
2025-05-07T20:32:17.5611464Z     D=7168,
2025-05-07T20:32:17.5611540Z     scale_ub=1200.0,
2025-05-07T20:32:17.5611623Z     contiguous=True,
2025-05-07T20:32:17.5611704Z     compiled=False,
2025-05-07T20:32:17.5611774Z )
2025-05-07T20:32:17.5611990Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5612164Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:17.5612169Z 
2025-05-07T20:32:17.5612249Z     @given(
2025-05-07T20:32:17.5612358Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5612450Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5612566Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5612724Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5612833Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5612906Z     )
2025-05-07T20:32:17.5613190Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5613285Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5613356Z         self,
2025-05-07T20:32:17.5613429Z         T: int,
2025-05-07T20:32:17.5613508Z         D: int,
2025-05-07T20:32:17.5613602Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5613685Z         contiguous: bool,
2025-05-07T20:32:17.5613812Z         compiled: bool,
2025-05-07T20:32:17.5613885Z     ) -> None:
2025-05-07T20:32:17.5613975Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5614048Z     
2025-05-07T20:32:17.5614209Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5616018Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.5616026Z 
2025-05-07T20:32:17.5616139Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.5616146Z 
2025-05-07T20:32:17.5616288Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5616515Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5616586Z     T=128,
2025-05-07T20:32:17.5616657Z     D=5120,
2025-05-07T20:32:17.5616731Z     scale_ub=1200.0,
2025-05-07T20:32:17.5616811Z     contiguous=False,
2025-05-07T20:32:17.5616895Z     compiled=False,
2025-05-07T20:32:17.5616971Z )
2025-05-07T20:32:17.5617185Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5617353Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:17.5617357Z 
2025-05-07T20:32:17.5617432Z     @given(
2025-05-07T20:32:17.5617544Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5617639Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5617750Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5617864Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5617977Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5618046Z     )
2025-05-07T20:32:17.5618287Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5618376Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5618445Z         self,
2025-05-07T20:32:17.5618521Z         T: int,
2025-05-07T20:32:17.5618594Z         D: int,
2025-05-07T20:32:17.5618684Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5618770Z         contiguous: bool,
2025-05-07T20:32:17.5618848Z         compiled: bool,
2025-05-07T20:32:17.5618923Z     ) -> None:
2025-05-07T20:32:17.5619020Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5619088Z     
2025-05-07T20:32:17.5619252Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5619321Z     
2025-05-07T20:32:17.5619408Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5619533Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5619621Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5619712Z         x0 = x[:, :D]
2025-05-07T20:32:17.5619798Z         x1 = x[:, D:]
2025-05-07T20:32:17.5619877Z     
2025-05-07T20:32:17.5619966Z         if contiguous:
2025-05-07T20:32:17.5620056Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5620141Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5620258Z     
2025-05-07T20:32:17.5620345Z         if scale_ub is not None:
2025-05-07T20:32:17.5620444Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5620576Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5620646Z             )
2025-05-07T20:32:17.5620757Z         else:
2025-05-07T20:32:17.5620854Z             scale_ub_tensor = None
2025-05-07T20:32:17.5620921Z     
2025-05-07T20:32:17.5621046Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5621134Z             op = silu_mul_quant
2025-05-07T20:32:17.5621214Z             if compiled:
2025-05-07T20:32:17.5621349Z                 op = torch.compile(op)
2025-05-07T20:32:17.5621456Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5621526Z     
2025-05-07T20:32:17.5621613Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5621618Z 
2025-05-07T20:32:17.5621715Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5621837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5621938Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5622034Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5622539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5622634Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5622993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5623211Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5623596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5623689Z     kernel = self.compile(
2025-05-07T20:32:17.5624069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5624240Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5624364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5624368Z 
2025-05-07T20:32:17.5624577Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a0fd2b0>
2025-05-07T20:32:17.5625358Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5625871Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a106ca0>}
2025-05-07T20:32:17.5626621Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5626818Z context = <triton._C.libtriton.ir.context object at 0x7feb2a0610b0>
2025-05-07T20:32:17.5626822Z 
2025-05-07T20:32:17.5626985Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5627250Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5627355Z                            module_map=module_map)
2025-05-07T20:32:17.5627515Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5627611Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5627688Z E       ^
2025-05-07T20:32:17.5628045Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5628050Z 
2025-05-07T20:32:17.5628461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5628466Z 
2025-05-07T20:32:17.5628562Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5628856Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5628933Z     T=2048,
2025-05-07T20:32:17.5629001Z     D=7168,
2025-05-07T20:32:17.5629077Z     scale_ub=None,
2025-05-07T20:32:17.5629195Z     contiguous=False,
2025-05-07T20:32:17.5629277Z     compiled=False,
2025-05-07T20:32:17.5629344Z )
2025-05-07T20:32:17.5629558Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5629730Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:17.5629735Z 
2025-05-07T20:32:17.5629930Z     @given(
2025-05-07T20:32:17.5630055Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5630171Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5630286Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5630397Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5630505Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5630583Z     )
2025-05-07T20:32:17.5630821Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5630912Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5630984Z         self,
2025-05-07T20:32:17.5631055Z         T: int,
2025-05-07T20:32:17.5631129Z         D: int,
2025-05-07T20:32:17.5631222Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5631306Z         contiguous: bool,
2025-05-07T20:32:17.5631388Z         compiled: bool,
2025-05-07T20:32:17.5631461Z     ) -> None:
2025-05-07T20:32:17.5631605Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5631681Z     
2025-05-07T20:32:17.5631845Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5633634Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.5633643Z 
2025-05-07T20:32:17.5633756Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.5633760Z 
2025-05-07T20:32:17.5633860Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5634078Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5634153Z     T=128,
2025-05-07T20:32:17.5634224Z     D=7168,
2025-05-07T20:32:17.5634304Z     scale_ub=1200.0,
2025-05-07T20:32:17.5634384Z     contiguous=True,
2025-05-07T20:32:17.5634463Z     compiled=True,
2025-05-07T20:32:17.5634533Z )
2025-05-07T20:32:17.5634742Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5634909Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:17.5634914Z 
2025-05-07T20:32:17.5634988Z     @given(
2025-05-07T20:32:17.5635102Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5635200Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5635309Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5635423Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5635532Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5635602Z     )
2025-05-07T20:32:17.5635853Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5635943Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5636016Z         self,
2025-05-07T20:32:17.5636090Z         T: int,
2025-05-07T20:32:17.5636164Z         D: int,
2025-05-07T20:32:17.5636258Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5636342Z         contiguous: bool,
2025-05-07T20:32:17.5636467Z         compiled: bool,
2025-05-07T20:32:17.5636542Z     ) -> None:
2025-05-07T20:32:17.5636631Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5636700Z     
2025-05-07T20:32:17.5636908Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5636976Z     
2025-05-07T20:32:17.5637064Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5637184Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5637268Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5637342Z         x0 = x[:, :D]
2025-05-07T20:32:17.5637461Z         x1 = x[:, D:]
2025-05-07T20:32:17.5637530Z     
2025-05-07T20:32:17.5637609Z         if contiguous:
2025-05-07T20:32:17.5637700Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5637784Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5637852Z     
2025-05-07T20:32:17.5637939Z         if scale_ub is not None:
2025-05-07T20:32:17.5638038Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5638175Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5638247Z             )
2025-05-07T20:32:17.5638320Z         else:
2025-05-07T20:32:17.5638411Z             scale_ub_tensor = None
2025-05-07T20:32:17.5638476Z     
2025-05-07T20:32:17.5638604Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5638695Z             op = silu_mul_quant
2025-05-07T20:32:17.5638774Z             if compiled:
2025-05-07T20:32:17.5638869Z                 op = torch.compile(op)
2025-05-07T20:32:17.5638974Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5639083Z     
2025-05-07T20:32:17.5639172Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5639179Z 
2025-05-07T20:32:17.5639271Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5639392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5639490Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5639584Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5639952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.5640042Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.5640534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5640630Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5640982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5641204Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5641542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5641632Z     kernel = self.compile(
2025-05-07T20:32:17.5642005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5642179Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5642303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5642308Z 
2025-05-07T20:32:17.5642517Z self = <triton.compiler.compiler.ASTSource object at 0x7feb2a047f10>
2025-05-07T20:32:17.5643295Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5643802Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7feb1f3458b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7feb2a3a4280>}
2025-05-07T20:32:17.5644547Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5644783Z context = <triton._C.libtriton.ir.context object at 0x7feae116fa70>
2025-05-07T20:32:17.5644788Z 
2025-05-07T20:32:17.5644951Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5645248Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5645352Z                            module_map=module_map)
2025-05-07T20:32:17.5645515Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5645608Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5645790Z E       ^
2025-05-07T20:32:17.5646141Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5646146Z 
2025-05-07T20:32:17.5646554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5646561Z 
2025-05-07T20:32:17.5646664Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5646881Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5646957Z     T=128,
2025-05-07T20:32:17.5647030Z     D=7168,
2025-05-07T20:32:17.5647108Z     scale_ub=1200.0,
2025-05-07T20:32:17.5647190Z     contiguous=True,
2025-05-07T20:32:17.5647269Z     compiled=False,
2025-05-07T20:32:17.5647336Z )
2025-05-07T20:32:17.5647549Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5647750Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:17.5647758Z 
2025-05-07T20:32:17.5647833Z     @given(
2025-05-07T20:32:17.5647950Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5648046Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5648158Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5648278Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5648391Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5648463Z     )
2025-05-07T20:32:17.5648704Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5648796Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5648873Z         self,
2025-05-07T20:32:17.5648946Z         T: int,
2025-05-07T20:32:17.5649018Z         D: int,
2025-05-07T20:32:17.5649116Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5649199Z         contiguous: bool,
2025-05-07T20:32:17.5649281Z         compiled: bool,
2025-05-07T20:32:17.5649359Z     ) -> None:
2025-05-07T20:32:17.5649449Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5649515Z     
2025-05-07T20:32:17.5649681Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5649763Z     
2025-05-07T20:32:17.5649862Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5650007Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5651817Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.5651828Z 
2025-05-07T20:32:17.5651946Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:17.5651951Z 
2025-05-07T20:32:17.5652047Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5652272Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5652342Z     T=128,
2025-05-07T20:32:17.5652413Z     D=5120,
2025-05-07T20:32:17.5652539Z     scale_ub=1200.0,
2025-05-07T20:32:17.5652615Z     contiguous=True,
2025-05-07T20:32:17.5652692Z     compiled=True,
2025-05-07T20:32:17.5652764Z )
2025-05-07T20:32:17.5652973Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5653177Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:17.5653183Z 
2025-05-07T20:32:17.5653254Z     @given(
2025-05-07T20:32:17.5653368Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5653463Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5653575Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5653739Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5653854Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5653922Z     )
2025-05-07T20:32:17.5654164Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5654254Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5654327Z         self,
2025-05-07T20:32:17.5654401Z         T: int,
2025-05-07T20:32:17.5654471Z         D: int,
2025-05-07T20:32:17.5654564Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5654655Z         contiguous: bool,
2025-05-07T20:32:17.5654737Z         compiled: bool,
2025-05-07T20:32:17.5654809Z     ) -> None:
2025-05-07T20:32:17.5654904Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5654971Z     
2025-05-07T20:32:17.5655135Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5655206Z     
2025-05-07T20:32:17.5655292Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5655478Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5657280Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.5657287Z 
2025-05-07T20:32:17.5657402Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:17.5657407Z 
2025-05-07T20:32:17.5657504Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5657726Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5657806Z     T=128,
2025-05-07T20:32:17.5657878Z     D=7168,
2025-05-07T20:32:17.5657953Z     scale_ub=None,
2025-05-07T20:32:17.5658035Z     contiguous=True,
2025-05-07T20:32:17.5658112Z     compiled=True,
2025-05-07T20:32:17.5658177Z )
2025-05-07T20:32:17.5658389Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5658549Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.5658555Z 
2025-05-07T20:32:17.5658631Z     @given(
2025-05-07T20:32:17.5658741Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5658835Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5658950Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5659060Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5659168Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5659240Z     )
2025-05-07T20:32:17.5659484Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5659574Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5659652Z         self,
2025-05-07T20:32:17.5659725Z         T: int,
2025-05-07T20:32:17.5659798Z         D: int,
2025-05-07T20:32:17.5659893Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5659982Z         contiguous: bool,
2025-05-07T20:32:17.5660084Z         compiled: bool,
2025-05-07T20:32:17.5660218Z     ) -> None:
2025-05-07T20:32:17.5660316Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5660387Z     
2025-05-07T20:32:17.5660548Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5662386Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:17.5662433Z 
2025-05-07T20:32:17.5662544Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:17.5662672Z =============================== warnings summary ===============================
2025-05-07T20:32:17.5662978Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:17.5663274Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:17.5663565Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:17.5664484Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:17.5664713Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:17.5664718Z 
2025-05-07T20:32:17.5664926Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:17.5665088Z ================= 1 failed, 1 deselected, 3 warnings in 24.11s =================
2025-05-07T20:32:19.1384508Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:19.1999846Z [EXEC] [ATTEMPT 0/2] Command attempt failed.
2025-05-07T20:32:19.2000196Z 
2025-05-07T20:32:21.2016976Z [EXEC] [ATTEMPT 1/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:23.3700032Z ============================= test session starts ==============================
2025-05-07T20:32:23.3700734Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:23.3701291Z cachedir: .pytest_cache
2025-05-07T20:32:23.3701896Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:23.3702646Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:23.3703058Z plugins: hypothesis-6.131.14
2025-05-07T20:32:24.9668048Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:25.1806184Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:25.1806961Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:25.1807381Z 
2025-05-07T20:32:27.8484208Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.8485039Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.8485456Z     T=1,
2025-05-07T20:32:27.8485645Z     D=5120,
2025-05-07T20:32:27.8485846Z     scale_ub=None,
2025-05-07T20:32:27.8486065Z     contiguous=True,
2025-05-07T20:32:27.8486291Z     compiled=True,
2025-05-07T20:32:27.8486790Z )
2025-05-07T20:32:27.8487114Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.8487596Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:27.8487866Z 
2025-05-07T20:32:27.8487945Z     @given(
2025-05-07T20:32:27.8488277Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.8488601Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.8488906Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.8489242Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.8489655Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.8489938Z     )
2025-05-07T20:32:27.8490297Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.8490743Z     def test_silu_mul_quant(
2025-05-07T20:32:27.8490980Z         self,
2025-05-07T20:32:27.8491181Z         T: int,
2025-05-07T20:32:27.8491381Z         D: int,
2025-05-07T20:32:27.8491601Z         scale_ub: Optional[float],
2025-05-07T20:32:27.8491882Z         contiguous: bool,
2025-05-07T20:32:27.8492126Z         compiled: bool,
2025-05-07T20:32:27.8492350Z     ) -> None:
2025-05-07T20:32:27.8492568Z         torch.manual_seed(2025)
2025-05-07T20:32:27.8492817Z     
2025-05-07T20:32:27.8493084Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.8493430Z     
2025-05-07T20:32:27.8493625Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.8493918Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.8494309Z         x = x_sign * x_clamp
2025-05-07T20:32:27.8494559Z         x0 = x[:, :D]
2025-05-07T20:32:27.8494777Z         x1 = x[:, D:]
2025-05-07T20:32:27.8494983Z     
2025-05-07T20:32:27.8495167Z         if contiguous:
2025-05-07T20:32:27.8495399Z             x0 = x0.contiguous()
2025-05-07T20:32:27.8495655Z             x1 = x1.contiguous()
2025-05-07T20:32:27.8495894Z     
2025-05-07T20:32:27.8496091Z         if scale_ub is not None:
2025-05-07T20:32:27.8496359Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.8496696Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.8497004Z             )
2025-05-07T20:32:27.8497190Z         else:
2025-05-07T20:32:27.8497404Z             scale_ub_tensor = None
2025-05-07T20:32:27.8497653Z     
2025-05-07T20:32:27.8497880Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.8498193Z             op = silu_mul_quant
2025-05-07T20:32:27.8498443Z             if compiled:
2025-05-07T20:32:27.8498700Z                 op = torch.compile(op)
2025-05-07T20:32:27.8498995Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.8499270Z     
2025-05-07T20:32:27.8499460Z         y_fp8, y_scale = fn()
2025-05-07T20:32:27.8499740Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:27.8500029Z     
2025-05-07T20:32:27.8500267Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.8500598Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:27.8500891Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:27.8501208Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:27.8501565Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:27.8501880Z     
2025-05-07T20:32:27.8502088Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:27.8502285Z 
2025-05-07T20:32:27.8502393Z moe/activation_test.py:126: 
2025-05-07T20:32:27.8502687Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.8503029Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:27.8503360Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:27.8504463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:27.8505229Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:27.8505864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.8506545Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.8507283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:27.8508007Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:27.8508763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:27.8509575Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:27.8510389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:27.8511027Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:27.8511632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:27.8512142Z     fn()
2025-05-07T20:32:27.8512650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:27.8513231Z     self.fn.run(
2025-05-07T20:32:27.8513697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.8514225Z     kernel = self.compile(
2025-05-07T20:32:27.8514827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.8515478Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.8515872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.8516107Z 
2025-05-07T20:32:27.8516317Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fcc0b2040>
2025-05-07T20:32:27.8517409Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.8518818Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fcc1d69d0>}
2025-05-07T20:32:27.8520169Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.8521189Z context = <triton._C.libtriton.ir.context object at 0x7f9fcc644d30>
2025-05-07T20:32:27.8521483Z 
2025-05-07T20:32:27.8521650Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.8522205Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.8522702Z                            module_map=module_map)
2025-05-07T20:32:27.8523064Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.8523422Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:27.8523689Z E       ^
2025-05-07T20:32:27.8524150Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.8524604Z 
2025-05-07T20:32:27.8525022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.8525545Z 
2025-05-07T20:32:27.8525648Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.8526076Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.8526479Z     T=2048,
2025-05-07T20:32:27.8526670Z     D=5120,
2025-05-07T20:32:27.8526864Z     scale_ub=1200.0,
2025-05-07T20:32:27.8527136Z     contiguous=True,
2025-05-07T20:32:27.8527358Z     compiled=False,
2025-05-07T20:32:27.8527565Z )
2025-05-07T20:32:29.3632974Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.3633855Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:29.3634260Z 
2025-05-07T20:32:29.3634373Z     @given(
2025-05-07T20:32:29.3634676Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.3635083Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.3635489Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.3635951Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.3636281Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.3636567Z     )
2025-05-07T20:32:29.3636916Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.3637359Z     def test_silu_mul_quant(
2025-05-07T20:32:29.3637609Z         self,
2025-05-07T20:32:29.3637800Z         T: int,
2025-05-07T20:32:29.3638001Z         D: int,
2025-05-07T20:32:29.3638220Z         scale_ub: Optional[float],
2025-05-07T20:32:29.3638488Z         contiguous: bool,
2025-05-07T20:32:29.3638727Z         compiled: bool,
2025-05-07T20:32:29.3638960Z     ) -> None:
2025-05-07T20:32:29.3639177Z         torch.manual_seed(2025)
2025-05-07T20:32:29.3639424Z     
2025-05-07T20:32:29.3639701Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.3640049Z     
2025-05-07T20:32:29.3640242Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.3640628Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.3640947Z         x = x_sign * x_clamp
2025-05-07T20:32:29.3641186Z         x0 = x[:, :D]
2025-05-07T20:32:29.3641406Z         x1 = x[:, D:]
2025-05-07T20:32:29.3641619Z     
2025-05-07T20:32:29.3641802Z         if contiguous:
2025-05-07T20:32:29.3642041Z             x0 = x0.contiguous()
2025-05-07T20:32:29.3642315Z             x1 = x1.contiguous()
2025-05-07T20:32:29.3642558Z     
2025-05-07T20:32:29.3642755Z         if scale_ub is not None:
2025-05-07T20:32:29.3643069Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.3643429Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.3643744Z             )
2025-05-07T20:32:29.3643934Z         else:
2025-05-07T20:32:29.3644146Z             scale_ub_tensor = None
2025-05-07T20:32:29.3644399Z     
2025-05-07T20:32:29.3644627Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.3644949Z             op = silu_mul_quant
2025-05-07T20:32:29.3645210Z             if compiled:
2025-05-07T20:32:29.3645466Z                 op = torch.compile(op)
2025-05-07T20:32:29.3645761Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.3646041Z     
2025-05-07T20:32:29.3646238Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.3646403Z 
2025-05-07T20:32:29.3646503Z moe/activation_test.py:117: 
2025-05-07T20:32:29.3646810Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.3647146Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.3647421Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.3648116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.3648813Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.3649358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.3650042Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.3650741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.3651269Z     kernel = self.compile(
2025-05-07T20:32:29.3651812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.3652550Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.3652942Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.3653178Z 
2025-05-07T20:32:29.3653461Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fb83733d0>
2025-05-07T20:32:29.3654551Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.3655991Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fb83b05e0>}
2025-05-07T20:32:29.3657334Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.3658354Z context = <triton._C.libtriton.ir.context object at 0x7f9fcb1e7bf0>
2025-05-07T20:32:29.3658649Z 
2025-05-07T20:32:29.3658814Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.3659340Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.3659809Z                            module_map=module_map)
2025-05-07T20:32:29.3660173Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.3660576Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.3660837Z E       ^
2025-05-07T20:32:29.3661294Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.3661751Z 
2025-05-07T20:32:29.3662163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.3662715Z 
2025-05-07T20:32:29.3662838Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.3663252Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.3663649Z     T=2048,
2025-05-07T20:32:29.3663838Z     D=5120,
2025-05-07T20:32:29.3664035Z     scale_ub=1200.0,
2025-05-07T20:32:29.3664256Z     contiguous=True,
2025-05-07T20:32:29.3664484Z     compiled=True,
2025-05-07T20:32:29.3664702Z )
2025-05-07T20:32:29.3665019Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.3665521Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:29.3665793Z 
2025-05-07T20:32:29.3665879Z     @given(
2025-05-07T20:32:29.3666122Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.3666429Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.3666737Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.3667072Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.3667402Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.3667692Z     )
2025-05-07T20:32:29.3668045Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.3668491Z     def test_silu_mul_quant(
2025-05-07T20:32:29.3668737Z         self,
2025-05-07T20:32:29.3668935Z         T: int,
2025-05-07T20:32:29.3669133Z         D: int,
2025-05-07T20:32:29.3669356Z         scale_ub: Optional[float],
2025-05-07T20:32:29.3669635Z         contiguous: bool,
2025-05-07T20:32:29.3669982Z         compiled: bool,
2025-05-07T20:32:29.3670214Z     ) -> None:
2025-05-07T20:32:29.3670430Z         torch.manual_seed(2025)
2025-05-07T20:32:29.3670677Z     
2025-05-07T20:32:29.3670942Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.3671283Z     
2025-05-07T20:32:29.3671477Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.3671760Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.3672122Z         x = x_sign * x_clamp
2025-05-07T20:32:29.3672364Z         x0 = x[:, :D]
2025-05-07T20:32:29.3672574Z         x1 = x[:, D:]
2025-05-07T20:32:29.3672789Z     
2025-05-07T20:32:29.3673006Z         if contiguous:
2025-05-07T20:32:29.3673294Z             x0 = x0.contiguous()
2025-05-07T20:32:29.3673562Z             x1 = x1.contiguous()
2025-05-07T20:32:29.3673806Z     
2025-05-07T20:32:29.3673991Z         if scale_ub is not None:
2025-05-07T20:32:29.3674271Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.3674622Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.3674972Z             )
2025-05-07T20:32:29.3675171Z         else:
2025-05-07T20:32:29.3675386Z             scale_ub_tensor = None
2025-05-07T20:32:29.3675632Z     
2025-05-07T20:32:29.3675872Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.3676191Z             op = silu_mul_quant
2025-05-07T20:32:29.3676451Z             if compiled:
2025-05-07T20:32:29.3676697Z                 op = torch.compile(op)
2025-05-07T20:32:29.3676996Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.3677276Z     
2025-05-07T20:32:29.3677465Z         y_fp8, y_scale = fn()
2025-05-07T20:32:29.3677765Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:29.3678061Z     
2025-05-07T20:32:29.3678298Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.3678635Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:29.3678931Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:29.3679288Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:29.3679654Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.3679968Z     
2025-05-07T20:32:29.3680172Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:29.3680371Z 
2025-05-07T20:32:29.3680471Z moe/activation_test.py:126: 
2025-05-07T20:32:29.3680780Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.3681128Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:29.3681460Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.3682270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:29.3683046Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:29.3683607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.3684295Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.3684980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:29.3685706Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:29.3686450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:29.3687203Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:29.3687938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:29.3688589Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:29.3689191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:29.3689719Z     fn()
2025-05-07T20:32:29.3690224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:29.3690801Z     self.fn.run(
2025-05-07T20:32:29.3691261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.3691792Z     kernel = self.compile(
2025-05-07T20:32:29.3692507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.3693311Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.3693858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.3694153Z 
2025-05-07T20:32:29.3694414Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fcad8e790>
2025-05-07T20:32:29.3695581Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.3696999Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fcac53550>}
2025-05-07T20:32:29.3698358Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.3699389Z context = <triton._C.libtriton.ir.context object at 0x7f9fcaad0b70>
2025-05-07T20:32:29.3699680Z 
2025-05-07T20:32:29.3699851Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.3700376Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.3700838Z                            module_map=module_map)
2025-05-07T20:32:29.3701250Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.3701610Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:29.3701877Z E       ^
2025-05-07T20:32:29.3702406Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.3702969Z 
2025-05-07T20:32:29.3703500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.3704466Z 
2025-05-07T20:32:29.3704585Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.3705000Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.3705413Z     T=16384,
2025-05-07T20:32:29.3705611Z     D=7168,
2025-05-07T20:32:29.3705807Z     scale_ub=1200.0,
2025-05-07T20:32:29.3706039Z     contiguous=False,
2025-05-07T20:32:29.3706281Z     compiled=False,
2025-05-07T20:32:29.3706492Z )
2025-05-07T20:32:30.7046516Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.7047064Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:30.7047449Z 
2025-05-07T20:32:30.7047575Z     @given(
2025-05-07T20:32:30.7047873Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.7048279Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.7048678Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.7049007Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.7049343Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.7049637Z     )
2025-05-07T20:32:30.7049997Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.7050431Z     def test_silu_mul_quant(
2025-05-07T20:32:30.7050676Z         self,
2025-05-07T20:32:30.7050872Z         T: int,
2025-05-07T20:32:30.7051096Z         D: int,
2025-05-07T20:32:30.7051310Z         scale_ub: Optional[float],
2025-05-07T20:32:30.7051584Z         contiguous: bool,
2025-05-07T20:32:30.7051822Z         compiled: bool,
2025-05-07T20:32:30.7052047Z     ) -> None:
2025-05-07T20:32:30.7052258Z         torch.manual_seed(2025)
2025-05-07T20:32:30.7052503Z     
2025-05-07T20:32:30.7052791Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.7053163Z     
2025-05-07T20:32:30.7053640Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.7053933Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.7054235Z         x = x_sign * x_clamp
2025-05-07T20:32:30.7054479Z         x0 = x[:, :D]
2025-05-07T20:32:30.7054697Z         x1 = x[:, D:]
2025-05-07T20:32:30.7054992Z     
2025-05-07T20:32:30.7055182Z         if contiguous:
2025-05-07T20:32:30.7055415Z             x0 = x0.contiguous()
2025-05-07T20:32:30.7055668Z             x1 = x1.contiguous()
2025-05-07T20:32:30.7055912Z     
2025-05-07T20:32:30.7056103Z         if scale_ub is not None:
2025-05-07T20:32:30.7063390Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.7063775Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.7064101Z             )
2025-05-07T20:32:30.7064300Z         else:
2025-05-07T20:32:30.7064515Z             scale_ub_tensor = None
2025-05-07T20:32:30.7064771Z     
2025-05-07T20:32:30.7065006Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.7065335Z             op = silu_mul_quant
2025-05-07T20:32:30.7065597Z             if compiled:
2025-05-07T20:32:30.7065838Z                 op = torch.compile(op)
2025-05-07T20:32:30.7066122Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.7066394Z     
2025-05-07T20:32:30.7066576Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.7066737Z 
2025-05-07T20:32:30.7066832Z moe/activation_test.py:117: 
2025-05-07T20:32:30.7067129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.7067468Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.7067860Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.7068560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.7069266Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.7069807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.7070588Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.7071253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.7071788Z     kernel = self.compile(
2025-05-07T20:32:30.7072324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.7072979Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.7073385Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.7073614Z 
2025-05-07T20:32:30.7073826Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fcae2b910>
2025-05-07T20:32:30.7074908Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.7076308Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fcabfdd30>}
2025-05-07T20:32:30.7077662Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.7078686Z context = <triton._C.libtriton.ir.context object at 0x7f9fca62db70>
2025-05-07T20:32:30.7078978Z 
2025-05-07T20:32:30.7079154Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.7079675Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.7080149Z                            module_map=module_map)
2025-05-07T20:32:30.7080521Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.7080965Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.7081227Z E       ^
2025-05-07T20:32:30.7081699Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.7082196Z 
2025-05-07T20:32:30.7082627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.7083190Z 
2025-05-07T20:32:30.7083293Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.7083712Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.7084162Z     T=1,
2025-05-07T20:32:30.7084341Z     D=7168,
2025-05-07T20:32:30.7084535Z     scale_ub=None,
2025-05-07T20:32:30.7084752Z     contiguous=True,
2025-05-07T20:32:30.7084977Z     compiled=True,
2025-05-07T20:32:30.7085191Z )
2025-05-07T20:32:30.7085517Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.7086005Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:30.7086263Z 
2025-05-07T20:32:30.7086340Z     @given(
2025-05-07T20:32:30.7086575Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.7086903Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.7087208Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.7087551Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.7087886Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.7088169Z     )
2025-05-07T20:32:30.7088571Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.7089027Z     def test_silu_mul_quant(
2025-05-07T20:32:30.7089273Z         self,
2025-05-07T20:32:30.7089466Z         T: int,
2025-05-07T20:32:30.7089666Z         D: int,
2025-05-07T20:32:30.7089887Z         scale_ub: Optional[float],
2025-05-07T20:32:30.7090156Z         contiguous: bool,
2025-05-07T20:32:30.7090399Z         compiled: bool,
2025-05-07T20:32:30.7090629Z     ) -> None:
2025-05-07T20:32:30.7090849Z         torch.manual_seed(2025)
2025-05-07T20:32:30.7091093Z     
2025-05-07T20:32:30.7091366Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.7091705Z     
2025-05-07T20:32:30.7091906Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.7092197Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.7092506Z         x = x_sign * x_clamp
2025-05-07T20:32:30.7092741Z         x0 = x[:, :D]
2025-05-07T20:32:30.7092987Z         x1 = x[:, D:]
2025-05-07T20:32:30.7093231Z     
2025-05-07T20:32:30.7093411Z         if contiguous:
2025-05-07T20:32:30.7093647Z             x0 = x0.contiguous()
2025-05-07T20:32:30.7093909Z             x1 = x1.contiguous()
2025-05-07T20:32:30.7094143Z     
2025-05-07T20:32:30.7094336Z         if scale_ub is not None:
2025-05-07T20:32:30.7094616Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.7094949Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.7095260Z             )
2025-05-07T20:32:30.7095452Z         else:
2025-05-07T20:32:30.7095658Z             scale_ub_tensor = None
2025-05-07T20:32:30.7095909Z     
2025-05-07T20:32:30.7096142Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.7096451Z             op = silu_mul_quant
2025-05-07T20:32:30.7096702Z             if compiled:
2025-05-07T20:32:30.7096951Z                 op = torch.compile(op)
2025-05-07T20:32:30.7097253Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.7097526Z     
2025-05-07T20:32:30.7097721Z         y_fp8, y_scale = fn()
2025-05-07T20:32:30.7098007Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:30.7098297Z     
2025-05-07T20:32:30.7098531Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.7098868Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:30.7099155Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:30.7099519Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:30.7099877Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:30.7100184Z     
2025-05-07T20:32:30.7100424Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:30.7100629Z 
2025-05-07T20:32:30.7100729Z moe/activation_test.py:126: 
2025-05-07T20:32:30.7101029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.7101358Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:30.7101691Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:30.7102517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:30.7103272Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:30.7104097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.7104791Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.7105478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:30.7106195Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:30.7106949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:30.7107773Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:30.7108505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:30.7109139Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:30.7109740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:30.7110306Z     fn()
2025-05-07T20:32:30.7110804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:30.7111383Z     self.fn.run(
2025-05-07T20:32:30.7111854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.7112383Z     kernel = self.compile(
2025-05-07T20:32:30.7112940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.7113618Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.7114021Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.7114251Z 
2025-05-07T20:32:30.7114458Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fcac4a250>
2025-05-07T20:32:30.7115547Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.7116930Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fcabfde50>}
2025-05-07T20:32:30.7118283Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.7119314Z context = <triton._C.libtriton.ir.context object at 0x7f9fca74a770>
2025-05-07T20:32:30.7119600Z 
2025-05-07T20:32:30.7119767Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.7120290Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.7120760Z                            module_map=module_map)
2025-05-07T20:32:30.7121208Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.7121563Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:30.7121833Z E       ^
2025-05-07T20:32:30.7122357Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.7122809Z 
2025-05-07T20:32:30.7123273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.7123789Z 
2025-05-07T20:32:30.7123892Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.7124368Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.7124768Z     T=4096,
2025-05-07T20:32:30.7124954Z     D=5120,
2025-05-07T20:32:30.7125145Z     scale_ub=None,
2025-05-07T20:32:30.7125362Z     contiguous=False,
2025-05-07T20:32:30.7125582Z     compiled=False,
2025-05-07T20:32:30.7125791Z )
2025-05-07T20:32:32.4756369Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.4756927Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:32.4757208Z 
2025-05-07T20:32:32.4757286Z     @given(
2025-05-07T20:32:32.4757537Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.4757849Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.4758151Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.4758485Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.4759077Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.4759363Z     )
2025-05-07T20:32:32.4759711Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.4760149Z     def test_silu_mul_quant(
2025-05-07T20:32:32.4760384Z         self,
2025-05-07T20:32:32.4760577Z         T: int,
2025-05-07T20:32:32.4760770Z         D: int,
2025-05-07T20:32:32.4760994Z         scale_ub: Optional[float],
2025-05-07T20:32:32.4761258Z         contiguous: bool,
2025-05-07T20:32:32.4761492Z         compiled: bool,
2025-05-07T20:32:32.4761720Z     ) -> None:
2025-05-07T20:32:32.4761926Z         torch.manual_seed(2025)
2025-05-07T20:32:32.4762166Z     
2025-05-07T20:32:32.4762438Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.4762774Z     
2025-05-07T20:32:32.4762964Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.4763267Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.4763605Z         x = x_sign * x_clamp
2025-05-07T20:32:32.4763853Z         x0 = x[:, :D]
2025-05-07T20:32:32.4764067Z         x1 = x[:, D:]
2025-05-07T20:32:32.4764267Z     
2025-05-07T20:32:32.4764448Z         if contiguous:
2025-05-07T20:32:32.4764678Z             x0 = x0.contiguous()
2025-05-07T20:32:32.4764928Z             x1 = x1.contiguous()
2025-05-07T20:32:32.4765162Z     
2025-05-07T20:32:32.4765349Z         if scale_ub is not None:
2025-05-07T20:32:32.4765617Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.4765954Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.4766256Z             )
2025-05-07T20:32:32.4766461Z         else:
2025-05-07T20:32:32.4766674Z             scale_ub_tensor = None
2025-05-07T20:32:32.4766916Z     
2025-05-07T20:32:32.4767148Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.4767461Z             op = silu_mul_quant
2025-05-07T20:32:32.4767704Z             if compiled:
2025-05-07T20:32:32.4767954Z                 op = torch.compile(op)
2025-05-07T20:32:32.4768253Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.4768520Z     
2025-05-07T20:32:32.4768711Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.4768879Z 
2025-05-07T20:32:32.4768981Z moe/activation_test.py:117: 
2025-05-07T20:32:32.4769272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.4769721Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.4769999Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.4770687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.4771451Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.4771990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.4772669Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.4773409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.4773981Z     kernel = self.compile(
2025-05-07T20:32:32.4774519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.4775167Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.4775557Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.4775789Z 
2025-05-07T20:32:32.4775995Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fca8d51f0>
2025-05-07T20:32:32.4777074Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.4778554Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fca83b5e0>}
2025-05-07T20:32:32.4779914Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.4780927Z context = <triton._C.libtriton.ir.context object at 0x7f9fca696330>
2025-05-07T20:32:32.4781219Z 
2025-05-07T20:32:32.4781380Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.4781898Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.4782368Z                            module_map=module_map)
2025-05-07T20:32:32.4782721Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.4783070Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.4783329Z E       ^
2025-05-07T20:32:32.4783785Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.4784242Z 
2025-05-07T20:32:32.4784654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.4785168Z 
2025-05-07T20:32:32.4785267Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.4785684Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.4786077Z     T=4096,
2025-05-07T20:32:32.4786264Z     D=7168,
2025-05-07T20:32:32.4786452Z     scale_ub=None,
2025-05-07T20:32:32.4786658Z     contiguous=False,
2025-05-07T20:32:32.4786886Z     compiled=False,
2025-05-07T20:32:32.4787094Z )
2025-05-07T20:32:32.4787405Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.4787900Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:32.4788178Z 
2025-05-07T20:32:32.4788251Z     @given(
2025-05-07T20:32:32.4788493Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.4788791Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.4789101Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.4789432Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.4789753Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.4790209Z     )
2025-05-07T20:32:32.4790556Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.4790986Z     def test_silu_mul_quant(
2025-05-07T20:32:32.4791233Z         self,
2025-05-07T20:32:32.4791425Z         T: int,
2025-05-07T20:32:32.4791661Z         D: int,
2025-05-07T20:32:32.4791879Z         scale_ub: Optional[float],
2025-05-07T20:32:32.4792157Z         contiguous: bool,
2025-05-07T20:32:32.4792384Z         compiled: bool,
2025-05-07T20:32:32.4792606Z     ) -> None:
2025-05-07T20:32:32.4792821Z         torch.manual_seed(2025)
2025-05-07T20:32:32.4793103Z     
2025-05-07T20:32:32.4793396Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.4793763Z     
2025-05-07T20:32:32.4793954Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.4794233Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.4794537Z         x = x_sign * x_clamp
2025-05-07T20:32:32.4794777Z         x0 = x[:, :D]
2025-05-07T20:32:32.4794985Z         x1 = x[:, D:]
2025-05-07T20:32:32.4795184Z     
2025-05-07T20:32:32.4795362Z         if contiguous:
2025-05-07T20:32:32.4795582Z             x0 = x0.contiguous()
2025-05-07T20:32:32.4795843Z             x1 = x1.contiguous()
2025-05-07T20:32:32.4796080Z     
2025-05-07T20:32:32.4796264Z         if scale_ub is not None:
2025-05-07T20:32:32.4796537Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.4796867Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.4797163Z             )
2025-05-07T20:32:32.4797352Z         else:
2025-05-07T20:32:32.4797634Z             scale_ub_tensor = None
2025-05-07T20:32:32.4797879Z     
2025-05-07T20:32:32.4798100Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.4798411Z             op = silu_mul_quant
2025-05-07T20:32:32.4798663Z             if compiled:
2025-05-07T20:32:32.4798900Z                 op = torch.compile(op)
2025-05-07T20:32:32.4799197Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.4799472Z     
2025-05-07T20:32:32.4799653Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.4799821Z 
2025-05-07T20:32:32.4799918Z moe/activation_test.py:117: 
2025-05-07T20:32:32.4800215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.4800536Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.4800815Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.4801504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.4802211Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.4802743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.4803425Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.4804399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.4804923Z     kernel = self.compile(
2025-05-07T20:32:32.4805461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.4806113Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.4806509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.4806734Z 
2025-05-07T20:32:32.4806944Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fcb02a190>
2025-05-07T20:32:32.4808038Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.4809427Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fcabfd8b0>}
2025-05-07T20:32:32.4810857Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.4811942Z context = <triton._C.libtriton.ir.context object at 0x7f9fca15f9b0>
2025-05-07T20:32:32.4812234Z 
2025-05-07T20:32:32.4812398Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.4812927Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.4813452Z                            module_map=module_map)
2025-05-07T20:32:32.4813806Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.4814158Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.4814418Z E       ^
2025-05-07T20:32:32.4814875Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.4815326Z 
2025-05-07T20:32:32.4815737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.4816253Z 
2025-05-07T20:32:32.4816359Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.4816770Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.4817168Z     T=128,
2025-05-07T20:32:32.4817348Z     D=7168,
2025-05-07T20:32:32.4817535Z     scale_ub=None,
2025-05-07T20:32:32.4817746Z     contiguous=False,
2025-05-07T20:32:32.4818031Z     compiled=True,
2025-05-07T20:32:32.4818227Z )
2025-05-07T20:32:32.5580654Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.5581698Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:32.5582228Z 
2025-05-07T20:32:32.5582382Z     @given(
2025-05-07T20:32:32.5582835Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.5583400Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.5583751Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.5584082Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.5584417Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.5584702Z     )
2025-05-07T20:32:32.5585045Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.5585488Z     def test_silu_mul_quant(
2025-05-07T20:32:32.5585731Z         self,
2025-05-07T20:32:32.5585920Z         T: int,
2025-05-07T20:32:32.5586129Z         D: int,
2025-05-07T20:32:32.5586348Z         scale_ub: Optional[float],
2025-05-07T20:32:32.5586613Z         contiguous: bool,
2025-05-07T20:32:32.5586852Z         compiled: bool,
2025-05-07T20:32:32.5587079Z     ) -> None:
2025-05-07T20:32:32.5587289Z         torch.manual_seed(2025)
2025-05-07T20:32:32.5587535Z     
2025-05-07T20:32:32.5587807Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.5588149Z     
2025-05-07T20:32:32.5588344Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.5588637Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.5588942Z         x = x_sign * x_clamp
2025-05-07T20:32:32.5589188Z         x0 = x[:, :D]
2025-05-07T20:32:32.5589404Z         x1 = x[:, D:]
2025-05-07T20:32:32.5589612Z     
2025-05-07T20:32:32.5589794Z         if contiguous:
2025-05-07T20:32:32.5590104Z             x0 = x0.contiguous()
2025-05-07T20:32:32.5590366Z             x1 = x1.contiguous()
2025-05-07T20:32:32.5590605Z     
2025-05-07T20:32:32.5590798Z         if scale_ub is not None:
2025-05-07T20:32:32.5591073Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.5591403Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.5591711Z             )
2025-05-07T20:32:32.5591904Z         else:
2025-05-07T20:32:32.5592109Z             scale_ub_tensor = None
2025-05-07T20:32:32.5592543Z     
2025-05-07T20:32:32.5592773Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.5593083Z             op = silu_mul_quant
2025-05-07T20:32:32.5593333Z             if compiled:
2025-05-07T20:32:32.5593579Z                 op = torch.compile(op)
2025-05-07T20:32:32.5593940Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.5594215Z     
2025-05-07T20:32:32.5594407Z         y_fp8, y_scale = fn()
2025-05-07T20:32:32.5594690Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:32.5594976Z     
2025-05-07T20:32:32.5595217Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.5595625Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:32.5595911Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:32.5596227Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:32.5596587Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.5596893Z     
2025-05-07T20:32:32.5597090Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:32.5597284Z 
2025-05-07T20:32:32.5597391Z moe/activation_test.py:126: 
2025-05-07T20:32:32.5597681Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.5598018Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:32.5598348Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.5599140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:32.5599978Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:32.5600529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.5601211Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.5601895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:32.5609574Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:32.5610378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:32.5611132Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:32.5611868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:32.5612512Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:32.5613111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:32.5613668Z     fn()
2025-05-07T20:32:32.5614187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:32.5614764Z     self.fn.run(
2025-05-07T20:32:32.5615222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.5615751Z     kernel = self.compile(
2025-05-07T20:32:32.5616297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.5616939Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.5617339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.5617576Z 
2025-05-07T20:32:32.5617785Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fca6c4be0>
2025-05-07T20:32:32.5618877Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.5620258Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fca2f73a0>}
2025-05-07T20:32:32.5621790Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.5622813Z context = <triton._C.libtriton.ir.context object at 0x7f9fca0d64b0>
2025-05-07T20:32:32.5623100Z 
2025-05-07T20:32:32.5623273Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.5623893Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.5624357Z                            module_map=module_map)
2025-05-07T20:32:32.5624726Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.5625083Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:32.5625340Z E       ^
2025-05-07T20:32:32.5625807Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.5626253Z 
2025-05-07T20:32:32.5626678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.5627191Z 
2025-05-07T20:32:32.5627301Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.5627709Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.5628109Z     T=128,
2025-05-07T20:32:32.5628297Z     D=7168,
2025-05-07T20:32:32.5628547Z     scale_ub=None,
2025-05-07T20:32:32.5628768Z     contiguous=False,
2025-05-07T20:32:32.5628994Z     compiled=False,
2025-05-07T20:32:32.5629195Z )
2025-05-07T20:32:32.9613122Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.9613709Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:32.9614025Z 
2025-05-07T20:32:32.9614104Z     @given(
2025-05-07T20:32:32.9614338Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.9614662Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.9614973Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.9615313Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.9615642Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.9615931Z     )
2025-05-07T20:32:32.9616277Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.9616725Z     def test_silu_mul_quant(
2025-05-07T20:32:32.9616981Z         self,
2025-05-07T20:32:32.9617170Z         T: int,
2025-05-07T20:32:32.9617362Z         D: int,
2025-05-07T20:32:32.9617582Z         scale_ub: Optional[float],
2025-05-07T20:32:32.9617849Z         contiguous: bool,
2025-05-07T20:32:32.9618089Z         compiled: bool,
2025-05-07T20:32:32.9618312Z     ) -> None:
2025-05-07T20:32:32.9618534Z         torch.manual_seed(2025)
2025-05-07T20:32:32.9618783Z     
2025-05-07T20:32:32.9619048Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.9619388Z     
2025-05-07T20:32:32.9619586Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.9619873Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.9620184Z         x = x_sign * x_clamp
2025-05-07T20:32:32.9620426Z         x0 = x[:, :D]
2025-05-07T20:32:32.9620635Z         x1 = x[:, D:]
2025-05-07T20:32:32.9620848Z     
2025-05-07T20:32:32.9621037Z         if contiguous:
2025-05-07T20:32:32.9621261Z             x0 = x0.contiguous()
2025-05-07T20:32:32.9621524Z             x1 = x1.contiguous()
2025-05-07T20:32:32.9621772Z     
2025-05-07T20:32:32.9621957Z         if scale_ub is not None:
2025-05-07T20:32:32.9622232Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.9622576Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.9622883Z             )
2025-05-07T20:32:32.9623298Z         else:
2025-05-07T20:32:32.9623510Z             scale_ub_tensor = None
2025-05-07T20:32:32.9623761Z     
2025-05-07T20:32:32.9623986Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.9624301Z             op = silu_mul_quant
2025-05-07T20:32:32.9624626Z             if compiled:
2025-05-07T20:32:32.9624870Z                 op = torch.compile(op)
2025-05-07T20:32:32.9625165Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.9625437Z     
2025-05-07T20:32:32.9625618Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.9625788Z 
2025-05-07T20:32:32.9625963Z moe/activation_test.py:117: 
2025-05-07T20:32:32.9626255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.9626578Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.9626860Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.9627553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.9628255Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.9628783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.9629466Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.9630214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.9630739Z     kernel = self.compile(
2025-05-07T20:32:32.9631346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.9632001Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.9632392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.9632616Z 
2025-05-07T20:32:32.9632822Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fca2c7280>
2025-05-07T20:32:32.9633960Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.9635360Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fca2dfd30>}
2025-05-07T20:32:32.9636704Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.9637726Z context = <triton._C.libtriton.ir.context object at 0x7f9fca0a22b0>
2025-05-07T20:32:32.9638011Z 
2025-05-07T20:32:32.9638173Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.9638692Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.9639165Z                            module_map=module_map)
2025-05-07T20:32:32.9639540Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.9639894Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.9640154Z E       ^
2025-05-07T20:32:32.9640622Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.9641069Z 
2025-05-07T20:32:32.9641488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.9642005Z 
2025-05-07T20:32:32.9642105Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.9642521Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.9642925Z     T=4096,
2025-05-07T20:32:32.9643109Z     D=5120,
2025-05-07T20:32:32.9643303Z     scale_ub=1200.0,
2025-05-07T20:32:32.9643606Z     contiguous=True,
2025-05-07T20:32:32.9643850Z     compiled=False,
2025-05-07T20:32:32.9644062Z )
2025-05-07T20:32:32.9644381Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.9644919Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:32.9645200Z 
2025-05-07T20:32:32.9645275Z     @given(
2025-05-07T20:32:32.9645508Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.9645815Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.9646124Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.9646498Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.9646827Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.9647103Z     )
2025-05-07T20:32:32.9647448Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.9647887Z     def test_silu_mul_quant(
2025-05-07T20:32:32.9648121Z         self,
2025-05-07T20:32:32.9648319Z         T: int,
2025-05-07T20:32:32.9648513Z         D: int,
2025-05-07T20:32:32.9648725Z         scale_ub: Optional[float],
2025-05-07T20:32:32.9648994Z         contiguous: bool,
2025-05-07T20:32:32.9649230Z         compiled: bool,
2025-05-07T20:32:32.9649448Z     ) -> None:
2025-05-07T20:32:32.9649662Z         torch.manual_seed(2025)
2025-05-07T20:32:32.9649902Z     
2025-05-07T20:32:32.9650165Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.9650501Z     
2025-05-07T20:32:32.9650696Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.9651027Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.9651338Z         x = x_sign * x_clamp
2025-05-07T20:32:32.9651579Z         x0 = x[:, :D]
2025-05-07T20:32:32.9651799Z         x1 = x[:, D:]
2025-05-07T20:32:32.9651999Z     
2025-05-07T20:32:32.9652186Z         if contiguous:
2025-05-07T20:32:32.9652420Z             x0 = x0.contiguous()
2025-05-07T20:32:32.9652679Z             x1 = x1.contiguous()
2025-05-07T20:32:32.9652919Z     
2025-05-07T20:32:32.9653107Z         if scale_ub is not None:
2025-05-07T20:32:32.9653401Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.9653767Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.9654085Z             )
2025-05-07T20:32:32.9654267Z         else:
2025-05-07T20:32:32.9654481Z             scale_ub_tensor = None
2025-05-07T20:32:32.9654736Z     
2025-05-07T20:32:32.9654967Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.9655282Z             op = silu_mul_quant
2025-05-07T20:32:32.9655543Z             if compiled:
2025-05-07T20:32:32.9655786Z                 op = torch.compile(op)
2025-05-07T20:32:32.9656089Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.9656367Z     
2025-05-07T20:32:32.9656561Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.9656728Z 
2025-05-07T20:32:32.9656827Z moe/activation_test.py:117: 
2025-05-07T20:32:32.9657130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.9657465Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.9657746Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.9658441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.9659132Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.9659669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.9660349Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.9661017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.9661543Z     kernel = self.compile(
2025-05-07T20:32:32.9662078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.9662780Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.9663175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.9663398Z 
2025-05-07T20:32:32.9663651Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fca3d9fa0>
2025-05-07T20:32:32.9664732Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.9666159Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fca3121f0>}
2025-05-07T20:32:32.9667509Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.9668537Z context = <triton._C.libtriton.ir.context object at 0x7f9fc9c360f0>
2025-05-07T20:32:32.9668826Z 
2025-05-07T20:32:32.9668999Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.9669542Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.9670058Z                            module_map=module_map)
2025-05-07T20:32:32.9670424Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.9670782Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.9671088Z E       ^
2025-05-07T20:32:32.9671551Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.9672007Z 
2025-05-07T20:32:32.9672424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.9672937Z 
2025-05-07T20:32:32.9673049Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.9673455Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.9673858Z     T=1,
2025-05-07T20:32:32.9674044Z     D=5120,
2025-05-07T20:32:32.9674230Z     scale_ub=None,
2025-05-07T20:32:32.9674445Z     contiguous=True,
2025-05-07T20:32:32.9674678Z     compiled=True,
2025-05-07T20:32:32.9674879Z )
2025-05-07T20:32:33.6242060Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.6242576Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:33.6242857Z 
2025-05-07T20:32:33.6242960Z     @given(
2025-05-07T20:32:33.6243284Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.6243770Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.6244167Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.6244604Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.6244975Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.6245263Z     )
2025-05-07T20:32:33.6245609Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.6246054Z     def test_silu_mul_quant(
2025-05-07T20:32:33.6246312Z         self,
2025-05-07T20:32:33.6246508Z         T: int,
2025-05-07T20:32:33.6246711Z         D: int,
2025-05-07T20:32:33.6246935Z         scale_ub: Optional[float],
2025-05-07T20:32:33.6247203Z         contiguous: bool,
2025-05-07T20:32:33.6247445Z         compiled: bool,
2025-05-07T20:32:33.6247677Z     ) -> None:
2025-05-07T20:32:33.6247898Z         torch.manual_seed(2025)
2025-05-07T20:32:33.6248144Z     
2025-05-07T20:32:33.6248422Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.6248762Z     
2025-05-07T20:32:33.6248958Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.6249254Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.6249849Z         x = x_sign * x_clamp
2025-05-07T20:32:33.6250095Z         x0 = x[:, :D]
2025-05-07T20:32:33.6250315Z         x1 = x[:, D:]
2025-05-07T20:32:33.6250528Z     
2025-05-07T20:32:33.6250710Z         if contiguous:
2025-05-07T20:32:33.6250947Z             x0 = x0.contiguous()
2025-05-07T20:32:33.6251332Z             x1 = x1.contiguous()
2025-05-07T20:32:33.6251574Z     
2025-05-07T20:32:33.6251774Z         if scale_ub is not None:
2025-05-07T20:32:33.6252052Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.6252390Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.6252781Z             )
2025-05-07T20:32:33.6252981Z         else:
2025-05-07T20:32:33.6253189Z             scale_ub_tensor = None
2025-05-07T20:32:33.6253444Z     
2025-05-07T20:32:33.6253713Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.6254070Z             op = silu_mul_quant
2025-05-07T20:32:33.6254320Z             if compiled:
2025-05-07T20:32:33.6254579Z                 op = torch.compile(op)
2025-05-07T20:32:33.6254881Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.6255153Z     
2025-05-07T20:32:33.6255350Z         y_fp8, y_scale = fn()
2025-05-07T20:32:33.6255640Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:33.6255949Z     
2025-05-07T20:32:33.6256182Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.6256519Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:33.6256815Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:33.6257207Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:33.6257577Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:33.6257892Z     
2025-05-07T20:32:33.6258091Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:33.6258291Z 
2025-05-07T20:32:33.6258394Z moe/activation_test.py:126: 
2025-05-07T20:32:33.6258690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.6259029Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:33.6259352Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:33.6260157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:33.6260921Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:33.6261462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.6262151Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.6262837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:33.6263557Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:33.6264302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:33.6265055Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:33.6265788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:33.6266425Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:33.6267019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:33.6267539Z     fn()
2025-05-07T20:32:33.6268056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:33.6268631Z     self.fn.run(
2025-05-07T20:32:33.6269098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.6269631Z     kernel = self.compile(
2025-05-07T20:32:33.6270289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.6270988Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.6271432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.6271661Z 
2025-05-07T20:32:33.6271879Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fca2229d0>
2025-05-07T20:32:33.6272969Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.6274476Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fca3124c0>}
2025-05-07T20:32:33.6275827Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.6276858Z context = <triton._C.libtriton.ir.context object at 0x7f9fc9b97870>
2025-05-07T20:32:33.6277145Z 
2025-05-07T20:32:33.6277323Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.6277841Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.6278313Z                            module_map=module_map)
2025-05-07T20:32:33.6278728Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.6279089Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:33.6279360Z E       ^
2025-05-07T20:32:33.6279831Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.6280283Z 
2025-05-07T20:32:33.6280709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.6281229Z 
2025-05-07T20:32:33.6281335Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.6281755Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.6282162Z     T=2048,
2025-05-07T20:32:33.6282353Z     D=5120,
2025-05-07T20:32:33.6282541Z     scale_ub=None,
2025-05-07T20:32:33.6282760Z     contiguous=True,
2025-05-07T20:32:33.6282986Z     compiled=True,
2025-05-07T20:32:33.6283188Z )
2025-05-07T20:32:34.2371772Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.2372685Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:34.2373131Z 
2025-05-07T20:32:34.2373254Z     @given(
2025-05-07T20:32:34.2373627Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.2374160Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.2374653Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.2375205Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.2375737Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.2376206Z     )
2025-05-07T20:32:34.2376739Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.2377368Z     def test_silu_mul_quant(
2025-05-07T20:32:34.2377698Z         self,
2025-05-07T20:32:34.2377946Z         T: int,
2025-05-07T20:32:34.2378203Z         D: int,
2025-05-07T20:32:34.2378493Z         scale_ub: Optional[float],
2025-05-07T20:32:34.2378848Z         contiguous: bool,
2025-05-07T20:32:34.2379172Z         compiled: bool,
2025-05-07T20:32:34.2379474Z     ) -> None:
2025-05-07T20:32:34.2379750Z         torch.manual_seed(2025)
2025-05-07T20:32:34.2380075Z     
2025-05-07T20:32:34.2380437Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.2380904Z     
2025-05-07T20:32:34.2381158Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.2381859Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.2382276Z         x = x_sign * x_clamp
2025-05-07T20:32:34.2382597Z         x0 = x[:, :D]
2025-05-07T20:32:34.2382880Z         x1 = x[:, D:]
2025-05-07T20:32:34.2383148Z     
2025-05-07T20:32:34.2383573Z         if contiguous:
2025-05-07T20:32:34.2383929Z             x0 = x0.contiguous()
2025-05-07T20:32:34.2384294Z             x1 = x1.contiguous()
2025-05-07T20:32:34.2384648Z     
2025-05-07T20:32:34.2384920Z         if scale_ub is not None:
2025-05-07T20:32:34.2385320Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.2385980Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.2386458Z             )
2025-05-07T20:32:34.2386738Z         else:
2025-05-07T20:32:34.2387044Z             scale_ub_tensor = None
2025-05-07T20:32:34.2387401Z     
2025-05-07T20:32:34.2387715Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.2388152Z             op = silu_mul_quant
2025-05-07T20:32:34.2388500Z             if compiled:
2025-05-07T20:32:34.2388846Z                 op = torch.compile(op)
2025-05-07T20:32:34.2389249Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.2389631Z     
2025-05-07T20:32:34.2390027Z         y_fp8, y_scale = fn()
2025-05-07T20:32:34.2390439Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:34.2390880Z     
2025-05-07T20:32:34.2391246Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.2391747Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:34.2392336Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:34.2392862Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:34.2393439Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:34.2393921Z     
2025-05-07T20:32:34.2394226Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:34.2394518Z 
2025-05-07T20:32:34.2394667Z moe/activation_test.py:126: 
2025-05-07T20:32:34.2395115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.2395658Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:34.2396180Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:34.2397489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:34.2398780Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:34.2399667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.2400858Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.2402008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:34.2403270Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:34.2404978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:34.2415328Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:34.2416653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:34.2417789Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:34.2418824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:34.2419738Z     fn()
2025-05-07T20:32:34.2420596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:34.2421617Z     self.fn.run(
2025-05-07T20:32:34.2422378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.2423441Z     kernel = self.compile(
2025-05-07T20:32:34.2424364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.2425502Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.2426178Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.2426567Z 
2025-05-07T20:32:34.2426906Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc9e9fe50>
2025-05-07T20:32:34.2428739Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.2431492Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fca00cf70>}
2025-05-07T20:32:34.2433883Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.2435683Z context = <triton._C.libtriton.ir.context object at 0x7f9fc9d4efb0>
2025-05-07T20:32:34.2436193Z 
2025-05-07T20:32:34.2436469Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.2437369Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.2438270Z                            module_map=module_map)
2025-05-07T20:32:34.2438883Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.2439478Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:34.2439919Z E       ^
2025-05-07T20:32:34.2440703Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.2441487Z 
2025-05-07T20:32:34.2442206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.2443104Z 
2025-05-07T20:32:34.2443287Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.2443989Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.2444661Z     T=128,
2025-05-07T20:32:34.2444962Z     D=5120,
2025-05-07T20:32:34.2445267Z     scale_ub=None,
2025-05-07T20:32:34.2445600Z     contiguous=True,
2025-05-07T20:32:34.2445958Z     compiled=True,
2025-05-07T20:32:34.2446292Z )
2025-05-07T20:32:35.2143277Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.2144158Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.2144599Z 
2025-05-07T20:32:35.2144722Z     @given(
2025-05-07T20:32:35.2145099Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.2145609Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.2146110Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.2146653Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.2147176Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.2147649Z     )
2025-05-07T20:32:35.2148186Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.2148802Z     def test_silu_mul_quant(
2025-05-07T20:32:35.2149128Z         self,
2025-05-07T20:32:35.2149389Z         T: int,
2025-05-07T20:32:35.2149650Z         D: int,
2025-05-07T20:32:35.2150044Z         scale_ub: Optional[float],
2025-05-07T20:32:35.2150408Z         contiguous: bool,
2025-05-07T20:32:35.2150730Z         compiled: bool,
2025-05-07T20:32:35.2151021Z     ) -> None:
2025-05-07T20:32:35.2151306Z         torch.manual_seed(2025)
2025-05-07T20:32:35.2151633Z     
2025-05-07T20:32:35.2151990Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.2152778Z     
2025-05-07T20:32:35.2153038Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.2153426Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.2153846Z         x = x_sign * x_clamp
2025-05-07T20:32:35.2154308Z         x0 = x[:, :D]
2025-05-07T20:32:35.2154597Z         x1 = x[:, D:]
2025-05-07T20:32:35.2154872Z     
2025-05-07T20:32:35.2155109Z         if contiguous:
2025-05-07T20:32:35.2155420Z             x0 = x0.contiguous()
2025-05-07T20:32:35.2155768Z             x1 = x1.contiguous()
2025-05-07T20:32:35.2156088Z     
2025-05-07T20:32:35.2156454Z         if scale_ub is not None:
2025-05-07T20:32:35.2156824Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.2157275Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.2157698Z             )
2025-05-07T20:32:35.2157956Z         else:
2025-05-07T20:32:35.2158238Z             scale_ub_tensor = None
2025-05-07T20:32:35.2158574Z     
2025-05-07T20:32:35.2158894Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.2159330Z             op = silu_mul_quant
2025-05-07T20:32:35.2159662Z             if compiled:
2025-05-07T20:32:35.2159992Z                 op = torch.compile(op)
2025-05-07T20:32:35.2160401Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.2160769Z     
2025-05-07T20:32:35.2161065Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.2161516Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.2161926Z     
2025-05-07T20:32:35.2162273Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.2162894Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.2163350Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.2163839Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.2164397Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.2164864Z     
2025-05-07T20:32:35.2165138Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.2165431Z 
2025-05-07T20:32:35.2165566Z moe/activation_test.py:126: 
2025-05-07T20:32:35.2165988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2166464Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.2166932Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.2168097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.2169302Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.2170134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.2171253Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.2172286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.2173431Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.2174675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:35.2175935Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.2177132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.2178221Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.2179235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.2180122Z     fn()
2025-05-07T20:32:35.2180994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.2181996Z     self.fn.run(
2025-05-07T20:32:35.2182886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.2183799Z     kernel = self.compile(
2025-05-07T20:32:35.2184835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.2185984Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.2186657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2187040Z 
2025-05-07T20:32:35.2187389Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc9d66700>
2025-05-07T20:32:35.2189350Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.2191911Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc9e93b80>}
2025-05-07T20:32:35.2194342Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.2196151Z context = <triton._C.libtriton.ir.context object at 0x7f9fc98cea30>
2025-05-07T20:32:35.2196647Z 
2025-05-07T20:32:35.2196931Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.2197887Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.2198702Z                            module_map=module_map)
2025-05-07T20:32:35.2199306Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.2199877Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.2200297Z E       ^
2025-05-07T20:32:35.2201068Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.2201829Z 
2025-05-07T20:32:35.2202531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.2203415Z 
2025-05-07T20:32:35.2203582Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.2205202Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.2205863Z     T=4096,
2025-05-07T20:32:35.2206144Z     D=5120,
2025-05-07T20:32:35.2206440Z     scale_ub=None,
2025-05-07T20:32:35.2206790Z     contiguous=True,
2025-05-07T20:32:35.2207131Z     compiled=True,
2025-05-07T20:32:35.2207455Z )
2025-05-07T20:32:36.0542590Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.0543243Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:36.0543616Z 
2025-05-07T20:32:36.0543731Z     @given(
2025-05-07T20:32:36.0544068Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.0544437Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.0544755Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.0545099Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.0545425Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.0545720Z     )
2025-05-07T20:32:36.0546075Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.0546519Z     def test_silu_mul_quant(
2025-05-07T20:32:36.0546771Z         self,
2025-05-07T20:32:36.0546980Z         T: int,
2025-05-07T20:32:36.0547179Z         D: int,
2025-05-07T20:32:36.0547398Z         scale_ub: Optional[float],
2025-05-07T20:32:36.0547671Z         contiguous: bool,
2025-05-07T20:32:36.0547905Z         compiled: bool,
2025-05-07T20:32:36.0548139Z     ) -> None:
2025-05-07T20:32:36.0548366Z         torch.manual_seed(2025)
2025-05-07T20:32:36.0548921Z     
2025-05-07T20:32:36.0549200Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.0549548Z     
2025-05-07T20:32:36.0549743Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.0550145Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.0550552Z         x = x_sign * x_clamp
2025-05-07T20:32:36.0550803Z         x0 = x[:, :D]
2025-05-07T20:32:36.0551023Z         x1 = x[:, D:]
2025-05-07T20:32:36.0551231Z     
2025-05-07T20:32:36.0551419Z         if contiguous:
2025-05-07T20:32:36.0551645Z             x0 = x0.contiguous()
2025-05-07T20:32:36.0551987Z             x1 = x1.contiguous()
2025-05-07T20:32:36.0552231Z     
2025-05-07T20:32:36.0552418Z         if scale_ub is not None:
2025-05-07T20:32:36.0552732Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.0553139Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.0553438Z             )
2025-05-07T20:32:36.0553629Z         else:
2025-05-07T20:32:36.0553844Z             scale_ub_tensor = None
2025-05-07T20:32:36.0554087Z     
2025-05-07T20:32:36.0554322Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.0554658Z             op = silu_mul_quant
2025-05-07T20:32:36.0554924Z             if compiled:
2025-05-07T20:32:36.0555172Z                 op = torch.compile(op)
2025-05-07T20:32:36.0555471Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.0555745Z     
2025-05-07T20:32:36.0555926Z         y_fp8, y_scale = fn()
2025-05-07T20:32:36.0556212Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:36.0556501Z     
2025-05-07T20:32:36.0556842Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.0557172Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:36.0557465Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:36.0557783Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:36.0558133Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:36.0558446Z     
2025-05-07T20:32:36.0558647Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:36.0558841Z 
2025-05-07T20:32:36.0558948Z moe/activation_test.py:126: 
2025-05-07T20:32:36.0559249Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.0559583Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:36.0559909Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:36.0560699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:36.0561461Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:36.0562006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.0562680Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.0563366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:36.0564082Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:36.0564830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:36.0565568Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:36.0566316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:36.0566949Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:36.0567546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:36.0568062Z     fn()
2025-05-07T20:32:36.0568555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:36.0569185Z     self.fn.run(
2025-05-07T20:32:36.0569647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.0570167Z     kernel = self.compile(
2025-05-07T20:32:36.0570739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.0571385Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.0571780Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.0572046Z 
2025-05-07T20:32:36.0572252Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc9bd2c70>
2025-05-07T20:32:36.0573340Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.0574793Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc99e0c10>}
2025-05-07T20:32:36.0576142Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.0577153Z context = <triton._C.libtriton.ir.context object at 0x7f9fc957a330>
2025-05-07T20:32:36.0577443Z 
2025-05-07T20:32:36.0577647Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.0578177Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.0578642Z                            module_map=module_map)
2025-05-07T20:32:36.0579004Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.0579363Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:36.0579626Z E       ^
2025-05-07T20:32:36.0580082Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.0580536Z 
2025-05-07T20:32:36.0580950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.0581463Z 
2025-05-07T20:32:36.0581564Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.0581972Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.0582375Z     T=16384,
2025-05-07T20:32:36.0582569Z     D=5120,
2025-05-07T20:32:36.0582759Z     scale_ub=None,
2025-05-07T20:32:36.0582964Z     contiguous=True,
2025-05-07T20:32:36.0583187Z     compiled=True,
2025-05-07T20:32:36.0583394Z )
2025-05-07T20:32:36.1022224Z W0507 20:32:36.100620 87906 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:36.1023792Z W0507 20:32:36.100620 87906 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:36.1025204Z W0507 20:32:36.100620 87906 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:36.1026206Z W0507 20:32:36.100620 87906 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:36.1027328Z W0507 20:32:36.100620 87906 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:36.2242890Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.2243645Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:36.2244200Z 
2025-05-07T20:32:36.2244279Z     @given(
2025-05-07T20:32:36.2244511Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.2244825Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.2245211Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.2245541Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.2245868Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.2246142Z     )
2025-05-07T20:32:36.2246496Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.2247009Z     def test_silu_mul_quant(
2025-05-07T20:32:36.2247245Z         self,
2025-05-07T20:32:36.2247428Z         T: int,
2025-05-07T20:32:36.2247620Z         D: int,
2025-05-07T20:32:36.2247835Z         scale_ub: Optional[float],
2025-05-07T20:32:36.2248096Z         contiguous: bool,
2025-05-07T20:32:36.2248329Z         compiled: bool,
2025-05-07T20:32:36.2248557Z     ) -> None:
2025-05-07T20:32:36.2248765Z         torch.manual_seed(2025)
2025-05-07T20:32:36.2249004Z     
2025-05-07T20:32:36.2249270Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.2249604Z     
2025-05-07T20:32:36.2249797Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.2250083Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.2250383Z         x = x_sign * x_clamp
2025-05-07T20:32:36.2250620Z         x0 = x[:, :D]
2025-05-07T20:32:36.2250832Z         x1 = x[:, D:]
2025-05-07T20:32:36.2251032Z     
2025-05-07T20:32:36.2251292Z         if contiguous:
2025-05-07T20:32:36.2251523Z             x0 = x0.contiguous()
2025-05-07T20:32:36.2251772Z             x1 = x1.contiguous()
2025-05-07T20:32:36.2252005Z     
2025-05-07T20:32:36.2252194Z         if scale_ub is not None:
2025-05-07T20:32:36.2252456Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.2252791Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.2253101Z             )
2025-05-07T20:32:36.2253289Z         else:
2025-05-07T20:32:36.2253488Z             scale_ub_tensor = None
2025-05-07T20:32:36.2253738Z     
2025-05-07T20:32:36.2253972Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.2254308Z             op = silu_mul_quant
2025-05-07T20:32:36.2254587Z             if compiled:
2025-05-07T20:32:36.2254835Z                 op = torch.compile(op)
2025-05-07T20:32:36.2255123Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.2255398Z     
2025-05-07T20:32:36.2255592Z         y_fp8, y_scale = fn()
2025-05-07T20:32:36.2255872Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:36.2256167Z     
2025-05-07T20:32:36.2256410Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.2256740Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:36.2257032Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:36.2257342Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:36.2257699Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:36.2258000Z     
2025-05-07T20:32:36.2258200Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:36.2258391Z 
2025-05-07T20:32:36.2258498Z moe/activation_test.py:126: 
2025-05-07T20:32:36.2258787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.2259119Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:36.2259440Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:36.2260220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:36.2260974Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:36.2261513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.2262187Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.2262919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:36.2263673Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:36.2264424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:36.2265213Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:36.2266000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:36.2266633Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:36.2267228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:36.2267740Z     fn()
2025-05-07T20:32:36.2268235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:36.2268809Z     self.fn.run(
2025-05-07T20:32:36.2269270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.2269784Z     kernel = self.compile(
2025-05-07T20:32:36.2270405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.2271048Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.2271509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.2271737Z 
2025-05-07T20:32:36.2271942Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc9fa34c0>
2025-05-07T20:32:36.2273023Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.2274417Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc99b0c10>}
2025-05-07T20:32:36.2275759Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.2276775Z context = <triton._C.libtriton.ir.context object at 0x7f9fc91cbcf0>
2025-05-07T20:32:36.2277068Z 
2025-05-07T20:32:36.2277231Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.2277752Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.2278217Z                            module_map=module_map)
2025-05-07T20:32:36.2278577Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.2278938Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:36.2279204Z E       ^
2025-05-07T20:32:36.2279663Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.2280121Z 
2025-05-07T20:32:36.2280530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.2281047Z 
2025-05-07T20:32:36.2281146Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.2281568Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.2281970Z     T=1,
2025-05-07T20:32:36.2282153Z     D=5120,
2025-05-07T20:32:36.2282348Z     scale_ub=1200.0,
2025-05-07T20:32:36.2282561Z     contiguous=True,
2025-05-07T20:32:36.2282785Z     compiled=True,
2025-05-07T20:32:36.2282992Z )
2025-05-07T20:32:36.3987982Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.3989029Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:36.3989352Z 
2025-05-07T20:32:36.3989436Z     @given(
2025-05-07T20:32:36.3989678Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.3990185Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.3990493Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.3990825Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.3991159Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.3991520Z     )
2025-05-07T20:32:36.3991880Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.3992321Z     def test_silu_mul_quant(
2025-05-07T20:32:36.3992562Z         self,
2025-05-07T20:32:36.3992813Z         T: int,
2025-05-07T20:32:36.3993058Z         D: int,
2025-05-07T20:32:36.3993277Z         scale_ub: Optional[float],
2025-05-07T20:32:36.3993547Z         contiguous: bool,
2025-05-07T20:32:36.3993786Z         compiled: bool,
2025-05-07T20:32:36.3994011Z     ) -> None:
2025-05-07T20:32:36.3994225Z         torch.manual_seed(2025)
2025-05-07T20:32:36.3994505Z     
2025-05-07T20:32:36.3994793Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.3995130Z     
2025-05-07T20:32:36.3995321Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.3995611Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.3995913Z         x = x_sign * x_clamp
2025-05-07T20:32:36.3996152Z         x0 = x[:, :D]
2025-05-07T20:32:36.3996454Z         x1 = x[:, D:]
2025-05-07T20:32:36.3996660Z     
2025-05-07T20:32:36.3996848Z         if contiguous:
2025-05-07T20:32:36.3997081Z             x0 = x0.contiguous()
2025-05-07T20:32:36.3997346Z             x1 = x1.contiguous()
2025-05-07T20:32:36.4004983Z     
2025-05-07T20:32:36.4005222Z         if scale_ub is not None:
2025-05-07T20:32:36.4005517Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.4005875Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.4006183Z             )
2025-05-07T20:32:36.4006385Z         else:
2025-05-07T20:32:36.4006608Z             scale_ub_tensor = None
2025-05-07T20:32:36.4006863Z     
2025-05-07T20:32:36.4007114Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.4007441Z             op = silu_mul_quant
2025-05-07T20:32:36.4007693Z             if compiled:
2025-05-07T20:32:36.4007953Z                 op = torch.compile(op)
2025-05-07T20:32:36.4008258Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.4008546Z     
2025-05-07T20:32:36.4008733Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.4008910Z 
2025-05-07T20:32:36.4009010Z moe/activation_test.py:117: 
2025-05-07T20:32:36.4009315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.4009647Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.4009942Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.4010514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:36.4011073Z     return fn(*args, **kwargs)
2025-05-07T20:32:36.4011741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.4012435Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.4012979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.4013660Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.4014326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.4014859Z     kernel = self.compile(
2025-05-07T20:32:36.4015410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.4016187Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.4016594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.4016824Z 
2025-05-07T20:32:36.4017109Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc9bfb610>
2025-05-07T20:32:36.4018211Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.4019679Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc9256670>}
2025-05-07T20:32:36.4021024Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.4022056Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8bf17b0>
2025-05-07T20:32:36.4022342Z 
2025-05-07T20:32:36.4022517Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.4023038Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.4023508Z                            module_map=module_map)
2025-05-07T20:32:36.4023884Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.4024340Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.4024625Z E       ^
2025-05-07T20:32:36.4025092Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.4025546Z 
2025-05-07T20:32:36.4025972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.4026488Z 
2025-05-07T20:32:36.4026595Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.4027010Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.4027419Z     T=1,
2025-05-07T20:32:36.4027612Z     D=5120,
2025-05-07T20:32:36.4027810Z     scale_ub=None,
2025-05-07T20:32:36.4028033Z     contiguous=False,
2025-05-07T20:32:36.4028266Z     compiled=True,
2025-05-07T20:32:36.4028473Z )
2025-05-07T20:32:36.4832146Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.4832867Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:36.4833178Z 
2025-05-07T20:32:36.4833271Z     @given(
2025-05-07T20:32:36.4833502Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.4833825Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.4834144Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.4834525Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.4834869Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.4835160Z     )
2025-05-07T20:32:36.4835509Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.4835960Z     def test_silu_mul_quant(
2025-05-07T20:32:36.4836205Z         self,
2025-05-07T20:32:36.4836406Z         T: int,
2025-05-07T20:32:36.4836609Z         D: int,
2025-05-07T20:32:36.4836833Z         scale_ub: Optional[float],
2025-05-07T20:32:36.4837111Z         contiguous: bool,
2025-05-07T20:32:36.4837349Z         compiled: bool,
2025-05-07T20:32:36.4837584Z     ) -> None:
2025-05-07T20:32:36.4837804Z         torch.manual_seed(2025)
2025-05-07T20:32:36.4838043Z     
2025-05-07T20:32:36.4838318Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.4838662Z     
2025-05-07T20:32:36.4838852Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.4839146Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.4839652Z         x = x_sign * x_clamp
2025-05-07T20:32:36.4839888Z         x0 = x[:, :D]
2025-05-07T20:32:36.4840109Z         x1 = x[:, D:]
2025-05-07T20:32:36.4840319Z     
2025-05-07T20:32:36.4840501Z         if contiguous:
2025-05-07T20:32:36.4840813Z             x0 = x0.contiguous()
2025-05-07T20:32:36.4841077Z             x1 = x1.contiguous()
2025-05-07T20:32:36.4841314Z     
2025-05-07T20:32:36.4841516Z         if scale_ub is not None:
2025-05-07T20:32:36.4841792Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.4842137Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.4842518Z             )
2025-05-07T20:32:36.4842714Z         else:
2025-05-07T20:32:36.4842926Z             scale_ub_tensor = None
2025-05-07T20:32:36.4843175Z     
2025-05-07T20:32:36.4843411Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.4843730Z             op = silu_mul_quant
2025-05-07T20:32:36.4843978Z             if compiled:
2025-05-07T20:32:36.4844237Z                 op = torch.compile(op)
2025-05-07T20:32:36.4844568Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.4844869Z     
2025-05-07T20:32:36.4845066Z         y_fp8, y_scale = fn()
2025-05-07T20:32:36.4845362Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:36.4845649Z     
2025-05-07T20:32:36.4845890Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.4846233Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:36.4846532Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:36.4846917Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:36.4847285Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:36.4847600Z     
2025-05-07T20:32:36.4847800Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:36.4848002Z 
2025-05-07T20:32:36.4848105Z moe/activation_test.py:126: 
2025-05-07T20:32:36.4848407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.4848742Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:36.4849073Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:36.4849875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:36.4850646Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:36.4851193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.4851893Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.4852585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:36.4853312Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:36.4854061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:36.4854828Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:36.4855566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:36.4856214Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:36.4856824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:36.4857346Z     fn()
2025-05-07T20:32:36.4857862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:36.4858444Z     self.fn.run(
2025-05-07T20:32:36.4858915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.4859449Z     kernel = self.compile(
2025-05-07T20:32:36.4860066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.4860713Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.4861153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.4861382Z 
2025-05-07T20:32:36.4861595Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8bccf70>
2025-05-07T20:32:36.4862688Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.4864115Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc92c0dc0>}
2025-05-07T20:32:36.4865518Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.4866552Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8b2a7f0>
2025-05-07T20:32:36.4866847Z 
2025-05-07T20:32:36.4867023Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.4867544Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.4868016Z                            module_map=module_map)
2025-05-07T20:32:36.4868426Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.4868789Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:36.4869050Z E       ^
2025-05-07T20:32:36.4869516Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.4870078Z 
2025-05-07T20:32:36.4870501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.4871018Z 
2025-05-07T20:32:36.4871130Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.4871545Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.4871950Z     T=1,
2025-05-07T20:32:36.4872135Z     D=5120,
2025-05-07T20:32:36.4872329Z     scale_ub=None,
2025-05-07T20:32:36.4872547Z     contiguous=True,
2025-05-07T20:32:36.4872773Z     compiled=False,
2025-05-07T20:32:36.4872981Z )
2025-05-07T20:32:36.8406740Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.8407385Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:36.8407755Z 
2025-05-07T20:32:36.8407875Z     @given(
2025-05-07T20:32:36.8408112Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.8408513Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.8408828Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.8409151Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.8409481Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.8409768Z     )
2025-05-07T20:32:36.8410128Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.8410561Z     def test_silu_mul_quant(
2025-05-07T20:32:36.8410801Z         self,
2025-05-07T20:32:36.8410994Z         T: int,
2025-05-07T20:32:36.8411184Z         D: int,
2025-05-07T20:32:36.8411403Z         scale_ub: Optional[float],
2025-05-07T20:32:36.8411675Z         contiguous: bool,
2025-05-07T20:32:36.8411904Z         compiled: bool,
2025-05-07T20:32:36.8412132Z     ) -> None:
2025-05-07T20:32:36.8412345Z         torch.manual_seed(2025)
2025-05-07T20:32:36.8412578Z     
2025-05-07T20:32:36.8412848Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.8413188Z     
2025-05-07T20:32:36.8413650Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.8413936Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.8414258Z         x = x_sign * x_clamp
2025-05-07T20:32:36.8414535Z         x0 = x[:, :D]
2025-05-07T20:32:36.8414743Z         x1 = x[:, D:]
2025-05-07T20:32:36.8415027Z     
2025-05-07T20:32:36.8415212Z         if contiguous:
2025-05-07T20:32:36.8415453Z             x0 = x0.contiguous()
2025-05-07T20:32:36.8415710Z             x1 = x1.contiguous()
2025-05-07T20:32:36.8415948Z     
2025-05-07T20:32:36.8416140Z         if scale_ub is not None:
2025-05-07T20:32:36.8416413Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.8416829Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.8417136Z             )
2025-05-07T20:32:36.8417323Z         else:
2025-05-07T20:32:36.8417530Z             scale_ub_tensor = None
2025-05-07T20:32:36.8417776Z     
2025-05-07T20:32:36.8417997Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.8418314Z             op = silu_mul_quant
2025-05-07T20:32:36.8418561Z             if compiled:
2025-05-07T20:32:36.8418803Z                 op = torch.compile(op)
2025-05-07T20:32:36.8419099Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.8419376Z     
2025-05-07T20:32:36.8419575Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.8419739Z 
2025-05-07T20:32:36.8419842Z moe/activation_test.py:117: 
2025-05-07T20:32:36.8420142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.8420474Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.8420828Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.8421526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.8422215Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.8422752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.8423429Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.8424085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.8424644Z     kernel = self.compile(
2025-05-07T20:32:36.8425197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.8425845Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.8426238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.8426465Z 
2025-05-07T20:32:36.8426697Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc90e7730>
2025-05-07T20:32:36.8427781Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.8429166Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc926edc0>}
2025-05-07T20:32:36.8430612Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.8431633Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8a76d30>
2025-05-07T20:32:36.8431919Z 
2025-05-07T20:32:36.8432090Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.8432608Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.8433066Z                            module_map=module_map)
2025-05-07T20:32:36.8433431Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.8433841Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.8434090Z E       ^
2025-05-07T20:32:36.8434555Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.8435004Z 
2025-05-07T20:32:36.8435464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.8435976Z 
2025-05-07T20:32:36.8436089Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.8436496Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.8436941Z     T=128,
2025-05-07T20:32:36.8437129Z     D=5120,
2025-05-07T20:32:36.8437312Z     scale_ub=None,
2025-05-07T20:32:36.8437530Z     contiguous=False,
2025-05-07T20:32:36.8437757Z     compiled=True,
2025-05-07T20:32:36.8437960Z )
2025-05-07T20:32:36.8438284Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.8438780Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:36.8439048Z 
2025-05-07T20:32:36.8439131Z     @given(
2025-05-07T20:32:36.8439357Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.8439672Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.8439985Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.8440311Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.8440646Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.8440931Z     )
2025-05-07T20:32:36.8441324Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.8441767Z     def test_silu_mul_quant(
2025-05-07T20:32:36.8442014Z         self,
2025-05-07T20:32:36.8442204Z         T: int,
2025-05-07T20:32:36.8442400Z         D: int,
2025-05-07T20:32:36.8442617Z         scale_ub: Optional[float],
2025-05-07T20:32:36.8442881Z         contiguous: bool,
2025-05-07T20:32:36.8443125Z         compiled: bool,
2025-05-07T20:32:36.8443346Z     ) -> None:
2025-05-07T20:32:36.8443562Z         torch.manual_seed(2025)
2025-05-07T20:32:36.8443798Z     
2025-05-07T20:32:36.8444067Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.8444445Z     
2025-05-07T20:32:36.8444649Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.8444940Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.8445249Z         x = x_sign * x_clamp
2025-05-07T20:32:36.8445478Z         x0 = x[:, :D]
2025-05-07T20:32:36.8445695Z         x1 = x[:, D:]
2025-05-07T20:32:36.8445905Z     
2025-05-07T20:32:36.8446081Z         if contiguous:
2025-05-07T20:32:36.8446310Z             x0 = x0.contiguous()
2025-05-07T20:32:36.8446570Z             x1 = x1.contiguous()
2025-05-07T20:32:36.8446802Z     
2025-05-07T20:32:36.8446991Z         if scale_ub is not None:
2025-05-07T20:32:36.8447262Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.8447593Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.8447897Z             )
2025-05-07T20:32:36.8448092Z         else:
2025-05-07T20:32:36.8448299Z             scale_ub_tensor = None
2025-05-07T20:32:36.8448542Z     
2025-05-07T20:32:36.8448774Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.8449091Z             op = silu_mul_quant
2025-05-07T20:32:36.8449338Z             if compiled:
2025-05-07T20:32:36.8449582Z                 op = torch.compile(op)
2025-05-07T20:32:36.8449885Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.8450156Z     
2025-05-07T20:32:36.8450352Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.8450515Z 
2025-05-07T20:32:36.8450620Z moe/activation_test.py:117: 
2025-05-07T20:32:36.8450910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.8451243Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.8451523Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.8452128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:36.8452679Z     return fn(*args, **kwargs)
2025-05-07T20:32:36.8453387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.8454074Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.8454629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.8455334Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.8456097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.8456624Z     kernel = self.compile(
2025-05-07T20:32:36.8457158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.8457809Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.8458202Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.8458427Z 
2025-05-07T20:32:36.8458638Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8a8a370>
2025-05-07T20:32:36.8459723Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.8461174Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8ee8040>}
2025-05-07T20:32:36.8462518Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.8463548Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8ab6370>
2025-05-07T20:32:36.8463834Z 
2025-05-07T20:32:36.8463999Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.8464521Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.8464982Z                            module_map=module_map)
2025-05-07T20:32:36.8465344Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.8465691Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.8465958Z E       ^
2025-05-07T20:32:36.8466423Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.8466871Z 
2025-05-07T20:32:36.8467285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.8467803Z 
2025-05-07T20:32:36.8467905Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.8468318Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.8468718Z     T=128,
2025-05-07T20:32:36.8468898Z     D=7168,
2025-05-07T20:32:36.8469090Z     scale_ub=1200.0,
2025-05-07T20:32:36.8469311Z     contiguous=False,
2025-05-07T20:32:36.8469532Z     compiled=False,
2025-05-07T20:32:36.8469736Z )
2025-05-07T20:32:37.0011728Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0012247Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:37.0012556Z 
2025-05-07T20:32:37.0012640Z     @given(
2025-05-07T20:32:37.0012885Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0013316Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0013677Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0014005Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0014573Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0014903Z     )
2025-05-07T20:32:37.0015250Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0015692Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0016025Z         self,
2025-05-07T20:32:37.0016218Z         T: int,
2025-05-07T20:32:37.0016418Z         D: int,
2025-05-07T20:32:37.0016638Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0016912Z         contiguous: bool,
2025-05-07T20:32:37.0017149Z         compiled: bool,
2025-05-07T20:32:37.0017377Z     ) -> None:
2025-05-07T20:32:37.0017679Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0017917Z     
2025-05-07T20:32:37.0018186Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0018531Z     
2025-05-07T20:32:37.0018717Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0019005Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0019320Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0019552Z         x0 = x[:, :D]
2025-05-07T20:32:37.0019767Z         x1 = x[:, D:]
2025-05-07T20:32:37.0019977Z     
2025-05-07T20:32:37.0020157Z         if contiguous:
2025-05-07T20:32:37.0020390Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0020654Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0020889Z     
2025-05-07T20:32:37.0021086Z         if scale_ub is not None:
2025-05-07T20:32:37.0021361Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0021691Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0022010Z             )
2025-05-07T20:32:37.0022281Z         else:
2025-05-07T20:32:37.0022497Z             scale_ub_tensor = None
2025-05-07T20:32:37.0022741Z     
2025-05-07T20:32:37.0022974Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0023293Z             op = silu_mul_quant
2025-05-07T20:32:37.0023537Z             if compiled:
2025-05-07T20:32:37.0023787Z                 op = torch.compile(op)
2025-05-07T20:32:37.0024087Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0024358Z     
2025-05-07T20:32:37.0024556Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.0024721Z 
2025-05-07T20:32:37.0024834Z moe/activation_test.py:117: 
2025-05-07T20:32:37.0025140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0032339Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.0032645Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0033361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.0034071Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.0034674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0035371Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0036038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0036577Z     kernel = self.compile(
2025-05-07T20:32:37.0037135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0037803Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0038205Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0038447Z 
2025-05-07T20:32:37.0038661Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8adf1f0>
2025-05-07T20:32:37.0039763Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0041175Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8ee8ca0>}
2025-05-07T20:32:37.0042653Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0043699Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8b019b0>
2025-05-07T20:32:37.0043997Z 
2025-05-07T20:32:37.0044166Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0044755Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0045271Z                            module_map=module_map)
2025-05-07T20:32:37.0045645Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0046012Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.0046280Z E       ^
2025-05-07T20:32:37.0046747Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0047214Z 
2025-05-07T20:32:37.0047639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0048164Z 
2025-05-07T20:32:37.0048281Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0048706Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0049109Z     T=128,
2025-05-07T20:32:37.0049298Z     D=5120,
2025-05-07T20:32:37.0049500Z     scale_ub=None,
2025-05-07T20:32:37.0049761Z     contiguous=False,
2025-05-07T20:32:37.0049996Z     compiled=False,
2025-05-07T20:32:37.0050210Z )
2025-05-07T20:32:37.0050529Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0051050Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:37.0051330Z 
2025-05-07T20:32:37.0051416Z     @given(
2025-05-07T20:32:37.0051654Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0051970Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0052285Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0052626Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0052956Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0053249Z     )
2025-05-07T20:32:37.0053604Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0054051Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0054295Z         self,
2025-05-07T20:32:37.0054501Z         T: int,
2025-05-07T20:32:37.0054701Z         D: int,
2025-05-07T20:32:37.0054920Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0055207Z         contiguous: bool,
2025-05-07T20:32:37.0055452Z         compiled: bool,
2025-05-07T20:32:37.0055675Z     ) -> None:
2025-05-07T20:32:37.0055895Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0056150Z     
2025-05-07T20:32:37.0056423Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0056779Z     
2025-05-07T20:32:37.0056981Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0057276Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0057593Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0057832Z         x0 = x[:, :D]
2025-05-07T20:32:37.0058048Z         x1 = x[:, D:]
2025-05-07T20:32:37.0058256Z     
2025-05-07T20:32:37.0058436Z         if contiguous:
2025-05-07T20:32:37.0058672Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0058933Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0059176Z     
2025-05-07T20:32:37.0059370Z         if scale_ub is not None:
2025-05-07T20:32:37.0059641Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0059979Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0060295Z             )
2025-05-07T20:32:37.0060483Z         else:
2025-05-07T20:32:37.0060746Z             scale_ub_tensor = None
2025-05-07T20:32:37.0061003Z     
2025-05-07T20:32:37.0061237Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0061556Z             op = silu_mul_quant
2025-05-07T20:32:37.0061812Z             if compiled:
2025-05-07T20:32:37.0062129Z                 op = torch.compile(op)
2025-05-07T20:32:37.0062428Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0062709Z     
2025-05-07T20:32:37.0062905Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.0063070Z 
2025-05-07T20:32:37.0063173Z moe/activation_test.py:117: 
2025-05-07T20:32:37.0063519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0063860Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.0064141Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0064902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.0065612Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.0066155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0066844Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0067511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0068051Z     kernel = self.compile(
2025-05-07T20:32:37.0068639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0069312Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0069712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0070002Z 
2025-05-07T20:32:37.0070218Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8aae1c0>
2025-05-07T20:32:37.0071324Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0072738Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc89fd310>}
2025-05-07T20:32:37.0074119Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0075219Z context = <triton._C.libtriton.ir.context object at 0x7f9fc89e0530>
2025-05-07T20:32:37.0075510Z 
2025-05-07T20:32:37.0075684Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0076209Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0076685Z                            module_map=module_map)
2025-05-07T20:32:37.0077055Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0077406Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.0077670Z E       ^
2025-05-07T20:32:37.0078144Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0078601Z 
2025-05-07T20:32:37.0079026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0079553Z 
2025-05-07T20:32:37.0079659Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0080078Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0080490Z     T=128,
2025-05-07T20:32:37.0080677Z     D=5120,
2025-05-07T20:32:37.0080876Z     scale_ub=1200.0,
2025-05-07T20:32:37.0081107Z     contiguous=True,
2025-05-07T20:32:37.0081387Z     compiled=False,
2025-05-07T20:32:37.0081595Z )
2025-05-07T20:32:37.2366929Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.2367459Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:37.2368012Z 
2025-05-07T20:32:37.2368095Z     @given(
2025-05-07T20:32:37.2368328Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.2368638Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.2368937Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.2369305Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.2369738Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.2370025Z     )
2025-05-07T20:32:37.2370377Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.2370808Z     def test_silu_mul_quant(
2025-05-07T20:32:37.2371051Z         self,
2025-05-07T20:32:37.2371248Z         T: int,
2025-05-07T20:32:37.2371446Z         D: int,
2025-05-07T20:32:37.2371655Z         scale_ub: Optional[float],
2025-05-07T20:32:37.2371925Z         contiguous: bool,
2025-05-07T20:32:37.2372160Z         compiled: bool,
2025-05-07T20:32:37.2372377Z     ) -> None:
2025-05-07T20:32:37.2372596Z         torch.manual_seed(2025)
2025-05-07T20:32:37.2372832Z     
2025-05-07T20:32:37.2373095Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.2373436Z     
2025-05-07T20:32:37.2373627Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.2373988Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.2374310Z         x = x_sign * x_clamp
2025-05-07T20:32:37.2374577Z         x0 = x[:, :D]
2025-05-07T20:32:37.2374809Z         x1 = x[:, D:]
2025-05-07T20:32:37.2375020Z     
2025-05-07T20:32:37.2375200Z         if contiguous:
2025-05-07T20:32:37.2375427Z             x0 = x0.contiguous()
2025-05-07T20:32:37.2375682Z             x1 = x1.contiguous()
2025-05-07T20:32:37.2375927Z     
2025-05-07T20:32:37.2376110Z         if scale_ub is not None:
2025-05-07T20:32:37.2376383Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.2376720Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.2377022Z             )
2025-05-07T20:32:37.2377211Z         else:
2025-05-07T20:32:37.2377427Z             scale_ub_tensor = None
2025-05-07T20:32:37.2377674Z     
2025-05-07T20:32:37.2377898Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.2378214Z             op = silu_mul_quant
2025-05-07T20:32:37.2378463Z             if compiled:
2025-05-07T20:32:37.2378708Z                 op = torch.compile(op)
2025-05-07T20:32:37.2379008Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.2379287Z     
2025-05-07T20:32:37.2379478Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.2379653Z 
2025-05-07T20:32:37.2379755Z moe/activation_test.py:117: 
2025-05-07T20:32:37.2380058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.2380392Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.2380672Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.2381368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.2382056Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.2382587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.2383266Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.2383930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.2384460Z     kernel = self.compile(
2025-05-07T20:32:37.2385045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.2385779Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.2386179Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.2386403Z 
2025-05-07T20:32:37.2386653Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8a00940>
2025-05-07T20:32:37.2387742Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.2389184Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc89fdee0>}
2025-05-07T20:32:37.2390619Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.2391639Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8fd7970>
2025-05-07T20:32:37.2391924Z 
2025-05-07T20:32:37.2392085Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.2392602Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.2393063Z                            module_map=module_map)
2025-05-07T20:32:37.2393426Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.2393770Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.2394027Z E       ^
2025-05-07T20:32:37.2394546Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.2394998Z 
2025-05-07T20:32:37.2395412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.2395927Z 
2025-05-07T20:32:37.2396032Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.2396442Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.2396858Z     T=1,
2025-05-07T20:32:37.2397043Z     D=7168,
2025-05-07T20:32:37.2397226Z     scale_ub=1200.0,
2025-05-07T20:32:37.2397450Z     contiguous=True,
2025-05-07T20:32:37.2397678Z     compiled=True,
2025-05-07T20:32:37.2397881Z )
2025-05-07T20:32:37.2398203Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.2398685Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:37.2398945Z 
2025-05-07T20:32:37.2399022Z     @given(
2025-05-07T20:32:37.2399252Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.2399567Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.2399880Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.2400201Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.2400532Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.2400815Z     )
2025-05-07T20:32:37.2401163Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.2401600Z     def test_silu_mul_quant(
2025-05-07T20:32:37.2401844Z         self,
2025-05-07T20:32:37.2402032Z         T: int,
2025-05-07T20:32:37.2402238Z         D: int,
2025-05-07T20:32:37.2402460Z         scale_ub: Optional[float],
2025-05-07T20:32:37.2402729Z         contiguous: bool,
2025-05-07T20:32:37.2402966Z         compiled: bool,
2025-05-07T20:32:37.2403185Z     ) -> None:
2025-05-07T20:32:37.2403397Z         torch.manual_seed(2025)
2025-05-07T20:32:37.2403648Z     
2025-05-07T20:32:37.2404186Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.2404557Z     
2025-05-07T20:32:37.2404771Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.2405069Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.2405388Z         x = x_sign * x_clamp
2025-05-07T20:32:37.2405704Z         x0 = x[:, :D]
2025-05-07T20:32:37.2405926Z         x1 = x[:, D:]
2025-05-07T20:32:37.2406135Z     
2025-05-07T20:32:37.2406319Z         if contiguous:
2025-05-07T20:32:37.2406555Z             x0 = x0.contiguous()
2025-05-07T20:32:37.2406882Z             x1 = x1.contiguous()
2025-05-07T20:32:37.2407123Z     
2025-05-07T20:32:37.2407323Z         if scale_ub is not None:
2025-05-07T20:32:37.2407603Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.2407935Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.2408246Z             )
2025-05-07T20:32:37.2408513Z         else:
2025-05-07T20:32:37.2408726Z             scale_ub_tensor = None
2025-05-07T20:32:37.2408980Z     
2025-05-07T20:32:37.2409218Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.2409539Z             op = silu_mul_quant
2025-05-07T20:32:37.2409792Z             if compiled:
2025-05-07T20:32:37.2410045Z                 op = torch.compile(op)
2025-05-07T20:32:37.2410353Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.2410624Z     
2025-05-07T20:32:37.2410820Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.2410989Z 
2025-05-07T20:32:37.2411096Z moe/activation_test.py:117: 
2025-05-07T20:32:37.2411396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.2411735Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.2412027Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.2412643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.2413213Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.2413887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.2414607Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.2415173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.2415863Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.2416534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.2417068Z     kernel = self.compile(
2025-05-07T20:32:37.2417616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.2418279Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.2418692Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.2418918Z 
2025-05-07T20:32:37.2419127Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc89f9460>
2025-05-07T20:32:37.2420214Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.2421598Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8fda940>}
2025-05-07T20:32:37.2422948Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.2423978Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8e34c70>
2025-05-07T20:32:37.2424271Z 
2025-05-07T20:32:37.2424438Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.2424965Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.2425438Z                            module_map=module_map)
2025-05-07T20:32:37.2425801Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.2426209Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.2426470Z E       ^
2025-05-07T20:32:37.2426939Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.2427434Z 
2025-05-07T20:32:37.2427852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.2428369Z 
2025-05-07T20:32:37.2428472Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.2428887Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.2429336Z     T=1,
2025-05-07T20:32:37.2429515Z     D=7168,
2025-05-07T20:32:37.2429715Z     scale_ub=1200.0,
2025-05-07T20:32:37.2429990Z     contiguous=False,
2025-05-07T20:32:37.2430207Z     compiled=True,
2025-05-07T20:32:37.2430405Z )
2025-05-07T20:32:37.5727761Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5728323Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:37.5728589Z 
2025-05-07T20:32:37.5728666Z     @given(
2025-05-07T20:32:37.5728893Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5729212Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5729513Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5729840Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5730165Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5730451Z     )
2025-05-07T20:32:37.5731062Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5731503Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5731740Z         self,
2025-05-07T20:32:37.5731924Z         T: int,
2025-05-07T20:32:37.5732118Z         D: int,
2025-05-07T20:32:37.5732339Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5732602Z         contiguous: bool,
2025-05-07T20:32:37.5732841Z         compiled: bool,
2025-05-07T20:32:37.5733064Z     ) -> None:
2025-05-07T20:32:37.5733270Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5733510Z     
2025-05-07T20:32:37.5733785Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5734117Z     
2025-05-07T20:32:37.5734311Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5734647Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5734950Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5735193Z         x0 = x[:, :D]
2025-05-07T20:32:37.5735411Z         x1 = x[:, D:]
2025-05-07T20:32:37.5735617Z     
2025-05-07T20:32:37.5735792Z         if contiguous:
2025-05-07T20:32:37.5736023Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5736277Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5736512Z     
2025-05-07T20:32:37.5736706Z         if scale_ub is not None:
2025-05-07T20:32:37.5736975Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5737309Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5737615Z             )
2025-05-07T20:32:37.5737807Z         else:
2025-05-07T20:32:37.5738009Z             scale_ub_tensor = None
2025-05-07T20:32:37.5738258Z     
2025-05-07T20:32:37.5738494Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5738800Z             op = silu_mul_quant
2025-05-07T20:32:37.5739043Z             if compiled:
2025-05-07T20:32:37.5739285Z                 op = torch.compile(op)
2025-05-07T20:32:37.5739574Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5739850Z     
2025-05-07T20:32:37.5740035Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5740196Z 
2025-05-07T20:32:37.5740300Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5740583Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5740917Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5741198Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5741839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.5742395Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.5743156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5743840Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5744365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5745173Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5745828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5746344Z     kernel = self.compile(
2025-05-07T20:32:37.5746879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5747528Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5747918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5748142Z 
2025-05-07T20:32:37.5748347Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8e1eb80>
2025-05-07T20:32:37.5749426Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5750971Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8dee5e0>}
2025-05-07T20:32:37.5752310Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5753326Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8ca6bb0>
2025-05-07T20:32:37.5753616Z 
2025-05-07T20:32:37.5753781Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5754304Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5754820Z                            module_map=module_map)
2025-05-07T20:32:37.5755177Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5755526Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5755785Z E       ^
2025-05-07T20:32:37.5756238Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5756692Z 
2025-05-07T20:32:37.5757106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5757622Z 
2025-05-07T20:32:37.5757724Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5758137Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5758534Z     T=1,
2025-05-07T20:32:37.5758715Z     D=7168,
2025-05-07T20:32:37.5758909Z     scale_ub=None,
2025-05-07T20:32:37.5759116Z     contiguous=False,
2025-05-07T20:32:37.5759336Z     compiled=True,
2025-05-07T20:32:37.5759542Z )
2025-05-07T20:32:37.6907169Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.6907724Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.6907994Z 
2025-05-07T20:32:37.6908072Z     @given(
2025-05-07T20:32:37.6908307Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.6908618Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.6908922Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.6909251Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.6909881Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.6910167Z     )
2025-05-07T20:32:37.6910513Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.6917856Z     def test_silu_mul_quant(
2025-05-07T20:32:37.6918123Z         self,
2025-05-07T20:32:37.6918329Z         T: int,
2025-05-07T20:32:37.6918531Z         D: int,
2025-05-07T20:32:37.6918747Z         scale_ub: Optional[float],
2025-05-07T20:32:37.6919031Z         contiguous: bool,
2025-05-07T20:32:37.6919274Z         compiled: bool,
2025-05-07T20:32:37.6919588Z     ) -> None:
2025-05-07T20:32:37.6919817Z         torch.manual_seed(2025)
2025-05-07T20:32:37.6920066Z     
2025-05-07T20:32:37.6920339Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.6920695Z     
2025-05-07T20:32:37.6920894Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.6921187Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.6921511Z         x = x_sign * x_clamp
2025-05-07T20:32:37.6921758Z         x0 = x[:, :D]
2025-05-07T20:32:37.6921978Z         x1 = x[:, D:]
2025-05-07T20:32:37.6922184Z     
2025-05-07T20:32:37.6922373Z         if contiguous:
2025-05-07T20:32:37.6922614Z             x0 = x0.contiguous()
2025-05-07T20:32:37.6922873Z             x1 = x1.contiguous()
2025-05-07T20:32:37.6923116Z     
2025-05-07T20:32:37.6923315Z         if scale_ub is not None:
2025-05-07T20:32:37.6923585Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.6923935Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.6924383Z             )
2025-05-07T20:32:37.6924582Z         else:
2025-05-07T20:32:37.6924796Z             scale_ub_tensor = None
2025-05-07T20:32:37.6925047Z     
2025-05-07T20:32:37.6925276Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.6925595Z             op = silu_mul_quant
2025-05-07T20:32:37.6925854Z             if compiled:
2025-05-07T20:32:37.6926102Z                 op = torch.compile(op)
2025-05-07T20:32:37.6926409Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.6926691Z     
2025-05-07T20:32:37.6926893Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.6927182Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.6927480Z     
2025-05-07T20:32:37.6927724Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.6928059Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.6928356Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.6928672Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.6929030Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.6929344Z     
2025-05-07T20:32:37.6929553Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.6929752Z 
2025-05-07T20:32:37.6929852Z moe/activation_test.py:126: 
2025-05-07T20:32:37.6930158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.6930505Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.6930837Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.6931635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.6932407Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.6932961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.6933653Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.6934339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.6935123Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.6935886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:37.6936685Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.6937464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.6938111Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.6938722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.6939235Z     fn()
2025-05-07T20:32:37.6939786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.6940366Z     self.fn.run(
2025-05-07T20:32:37.6940838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.6941364Z     kernel = self.compile(
2025-05-07T20:32:37.6941908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.6942560Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.6942961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.6943188Z 
2025-05-07T20:32:37.6943395Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8ca46a0>
2025-05-07T20:32:37.6944532Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.6945978Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8c44160>}
2025-05-07T20:32:37.6947319Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.6948343Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8c3e430>
2025-05-07T20:32:37.6948635Z 
2025-05-07T20:32:37.6948804Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.6949328Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.6949794Z                            module_map=module_map)
2025-05-07T20:32:37.6950231Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.6950588Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.6950851Z E       ^
2025-05-07T20:32:37.6951307Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.6951760Z 
2025-05-07T20:32:37.6952172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.6952691Z 
2025-05-07T20:32:37.6952793Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.6953208Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.6953612Z     T=1,
2025-05-07T20:32:37.6953796Z     D=5120,
2025-05-07T20:32:37.6953987Z     scale_ub=1200.0,
2025-05-07T20:32:37.6954209Z     contiguous=False,
2025-05-07T20:32:37.6954440Z     compiled=True,
2025-05-07T20:32:37.6954651Z )
2025-05-07T20:32:37.8946866Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.8947535Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:37.8947813Z 
2025-05-07T20:32:37.8947893Z     @given(
2025-05-07T20:32:37.8948128Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.8948439Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.8948752Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.8949304Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.8949645Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.8950051Z     )
2025-05-07T20:32:37.8950495Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.8950937Z     def test_silu_mul_quant(
2025-05-07T20:32:37.8951178Z         self,
2025-05-07T20:32:37.8951372Z         T: int,
2025-05-07T20:32:37.8951572Z         D: int,
2025-05-07T20:32:37.8951784Z         scale_ub: Optional[float],
2025-05-07T20:32:37.8952133Z         contiguous: bool,
2025-05-07T20:32:37.8952381Z         compiled: bool,
2025-05-07T20:32:37.8952603Z     ) -> None:
2025-05-07T20:32:37.8952821Z         torch.manual_seed(2025)
2025-05-07T20:32:37.8953066Z     
2025-05-07T20:32:37.8953336Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.8953680Z     
2025-05-07T20:32:37.8953876Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.8954168Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.8954479Z         x = x_sign * x_clamp
2025-05-07T20:32:37.8954725Z         x0 = x[:, :D]
2025-05-07T20:32:37.8954947Z         x1 = x[:, D:]
2025-05-07T20:32:37.8955157Z     
2025-05-07T20:32:37.8955352Z         if contiguous:
2025-05-07T20:32:37.8955586Z             x0 = x0.contiguous()
2025-05-07T20:32:37.8955846Z             x1 = x1.contiguous()
2025-05-07T20:32:37.8956086Z     
2025-05-07T20:32:37.8956280Z         if scale_ub is not None:
2025-05-07T20:32:37.8956630Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.8956986Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.8957299Z             )
2025-05-07T20:32:37.8957492Z         else:
2025-05-07T20:32:37.8957712Z             scale_ub_tensor = None
2025-05-07T20:32:37.8957962Z     
2025-05-07T20:32:37.8958194Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.8958511Z             op = silu_mul_quant
2025-05-07T20:32:37.8958765Z             if compiled:
2025-05-07T20:32:37.8959011Z                 op = torch.compile(op)
2025-05-07T20:32:37.8959313Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.8959589Z     
2025-05-07T20:32:37.8959787Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.8959951Z 
2025-05-07T20:32:37.8960055Z moe/activation_test.py:117: 
2025-05-07T20:32:37.8960360Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.8960693Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.8960976Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.8961537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.8962104Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.8962761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.8963453Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.8963998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.8964688Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.8965341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.8965871Z     kernel = self.compile(
2025-05-07T20:32:37.8966417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.8967068Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.8967459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.8967691Z 
2025-05-07T20:32:37.8967901Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc9040340>
2025-05-07T20:32:37.8969063Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.8970490Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8c44b80>}
2025-05-07T20:32:37.8971842Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.8972913Z context = <triton._C.libtriton.ir.context object at 0x7f9fc872d270>
2025-05-07T20:32:37.8973209Z 
2025-05-07T20:32:37.8973376Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.8973901Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.8974369Z                            module_map=module_map)
2025-05-07T20:32:37.8974738Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.8975103Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.8975359Z E       ^
2025-05-07T20:32:37.8975833Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.8976289Z 
2025-05-07T20:32:37.8976701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.8977256Z 
2025-05-07T20:32:37.8977368Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.8977782Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.8978190Z     T=1,
2025-05-07T20:32:37.8978381Z     D=5120,
2025-05-07T20:32:37.8978572Z     scale_ub=1200.0,
2025-05-07T20:32:37.8978803Z     contiguous=False,
2025-05-07T20:32:37.8979039Z     compiled=False,
2025-05-07T20:32:37.8979255Z )
2025-05-07T20:32:37.8979573Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.8980063Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:37.8980330Z 
2025-05-07T20:32:37.8980412Z     @given(
2025-05-07T20:32:37.8980638Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.8980950Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.8981262Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.8981592Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.8981924Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.8982209Z     )
2025-05-07T20:32:37.8982561Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.8982999Z     def test_silu_mul_quant(
2025-05-07T20:32:37.8983242Z         self,
2025-05-07T20:32:37.8983443Z         T: int,
2025-05-07T20:32:37.8983659Z         D: int,
2025-05-07T20:32:37.8983874Z         scale_ub: Optional[float],
2025-05-07T20:32:37.8984154Z         contiguous: bool,
2025-05-07T20:32:37.8984398Z         compiled: bool,
2025-05-07T20:32:37.8984620Z     ) -> None:
2025-05-07T20:32:37.8984847Z         torch.manual_seed(2025)
2025-05-07T20:32:37.8985092Z     
2025-05-07T20:32:37.8985358Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.8985706Z     
2025-05-07T20:32:37.8985903Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.8986195Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.8986517Z         x = x_sign * x_clamp
2025-05-07T20:32:37.8986764Z         x0 = x[:, :D]
2025-05-07T20:32:37.8986978Z         x1 = x[:, D:]
2025-05-07T20:32:37.8987197Z     
2025-05-07T20:32:37.8987387Z         if contiguous:
2025-05-07T20:32:37.8987627Z             x0 = x0.contiguous()
2025-05-07T20:32:37.8987883Z             x1 = x1.contiguous()
2025-05-07T20:32:37.8988181Z     
2025-05-07T20:32:37.8988377Z         if scale_ub is not None:
2025-05-07T20:32:37.8988647Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.8988981Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.8989296Z             )
2025-05-07T20:32:37.8989527Z         else:
2025-05-07T20:32:37.8989740Z             scale_ub_tensor = None
2025-05-07T20:32:37.8990087Z     
2025-05-07T20:32:37.8990312Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.8990628Z             op = silu_mul_quant
2025-05-07T20:32:37.8990877Z             if compiled:
2025-05-07T20:32:37.8991168Z                 op = torch.compile(op)
2025-05-07T20:32:37.8991465Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.8991740Z     
2025-05-07T20:32:37.8991925Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.8992104Z 
2025-05-07T20:32:37.8992205Z moe/activation_test.py:117: 
2025-05-07T20:32:37.8992501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.8992865Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.8993235Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.8993931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.8994620Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.8995152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.8995899Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.8996571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.8997102Z     kernel = self.compile(
2025-05-07T20:32:37.8997638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.8998297Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.8998693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.8998922Z 
2025-05-07T20:32:37.8999137Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8c17e50>
2025-05-07T20:32:37.9000221Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.9001610Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc9015550>}
2025-05-07T20:32:37.9002981Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.9004297Z context = <triton._C.libtriton.ir.context object at 0x7f9fc901cd30>
2025-05-07T20:32:37.9004586Z 
2025-05-07T20:32:37.9004757Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.9005285Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.9005760Z                            module_map=module_map)
2025-05-07T20:32:37.9006127Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.9006479Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.9006744Z E       ^
2025-05-07T20:32:37.9007216Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.9007670Z 
2025-05-07T20:32:37.9008085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.9008601Z 
2025-05-07T20:32:37.9008787Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.9009201Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.9009602Z     T=16384,
2025-05-07T20:32:37.9009794Z     D=5120,
2025-05-07T20:32:37.9009987Z     scale_ub=1200.0,
2025-05-07T20:32:37.9010278Z     contiguous=False,
2025-05-07T20:32:37.9010501Z     compiled=True,
2025-05-07T20:32:37.9010705Z )
2025-05-07T20:32:38.0191525Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.0192294Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:38.0192918Z 
2025-05-07T20:32:38.0193009Z     @given(
2025-05-07T20:32:38.0193240Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.0193558Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.0193870Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.0194198Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.0194537Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.0194853Z     )
2025-05-07T20:32:38.0195220Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.0195659Z     def test_silu_mul_quant(
2025-05-07T20:32:38.0195913Z         self,
2025-05-07T20:32:38.0196106Z         T: int,
2025-05-07T20:32:38.0196308Z         D: int,
2025-05-07T20:32:38.0196529Z         scale_ub: Optional[float],
2025-05-07T20:32:38.0196799Z         contiguous: bool,
2025-05-07T20:32:38.0197041Z         compiled: bool,
2025-05-07T20:32:38.0197270Z     ) -> None:
2025-05-07T20:32:38.0197577Z         torch.manual_seed(2025)
2025-05-07T20:32:38.0197821Z     
2025-05-07T20:32:38.0198097Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.0198445Z     
2025-05-07T20:32:38.0198634Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.0198926Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.0199241Z         x = x_sign * x_clamp
2025-05-07T20:32:38.0199475Z         x0 = x[:, :D]
2025-05-07T20:32:38.0199693Z         x1 = x[:, D:]
2025-05-07T20:32:38.0199905Z     
2025-05-07T20:32:38.0200089Z         if contiguous:
2025-05-07T20:32:38.0200326Z             x0 = x0.contiguous()
2025-05-07T20:32:38.0200588Z             x1 = x1.contiguous()
2025-05-07T20:32:38.0200821Z     
2025-05-07T20:32:38.0201012Z         if scale_ub is not None:
2025-05-07T20:32:38.0201285Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.0201619Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.0201939Z             )
2025-05-07T20:32:38.0202150Z         else:
2025-05-07T20:32:38.0202364Z             scale_ub_tensor = None
2025-05-07T20:32:38.0202609Z     
2025-05-07T20:32:38.0202849Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.0203171Z             op = silu_mul_quant
2025-05-07T20:32:38.0203428Z             if compiled:
2025-05-07T20:32:38.0203999Z                 op = torch.compile(op)
2025-05-07T20:32:38.0204325Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.0204601Z     
2025-05-07T20:32:38.0204802Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.0204969Z 
2025-05-07T20:32:38.0205080Z moe/activation_test.py:117: 
2025-05-07T20:32:38.0205384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.0205727Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.0206017Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.0206595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:38.0207154Z     return fn(*args, **kwargs)
2025-05-07T20:32:38.0207823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.0208516Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.0209048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.0209839Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.0210577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.0211109Z     kernel = self.compile(
2025-05-07T20:32:38.0211643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.0212298Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.0212770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.0212999Z 
2025-05-07T20:32:38.0213216Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8c19b50>
2025-05-07T20:32:38.0214294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.0215750Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8cee1f0>}
2025-05-07T20:32:38.0217099Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.0218203Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8d04b70>
2025-05-07T20:32:38.0218498Z 
2025-05-07T20:32:38.0218664Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.0219193Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.0219669Z                            module_map=module_map)
2025-05-07T20:32:38.0220039Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.0220388Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.0220658Z E       ^
2025-05-07T20:32:38.0221126Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.0221577Z 
2025-05-07T20:32:38.0221993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.0222512Z 
2025-05-07T20:32:38.0222619Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.0223038Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.0223447Z     T=2048,
2025-05-07T20:32:38.0223634Z     D=7168,
2025-05-07T20:32:38.0223837Z     scale_ub=1200.0,
2025-05-07T20:32:38.0224064Z     contiguous=False,
2025-05-07T20:32:38.0224288Z     compiled=True,
2025-05-07T20:32:38.0224496Z )
2025-05-07T20:32:38.0224849Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.0225370Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:38.0225649Z 
2025-05-07T20:32:38.0225725Z     @given(
2025-05-07T20:32:38.0225961Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.0226286Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.0226598Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.0226941Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.0227281Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.0227568Z     )
2025-05-07T20:32:38.0227925Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.0228373Z     def test_silu_mul_quant(
2025-05-07T20:32:38.0228615Z         self,
2025-05-07T20:32:38.0228815Z         T: int,
2025-05-07T20:32:38.0229022Z         D: int,
2025-05-07T20:32:38.0229243Z         scale_ub: Optional[float],
2025-05-07T20:32:38.0229570Z         contiguous: bool,
2025-05-07T20:32:38.0229900Z         compiled: bool,
2025-05-07T20:32:38.0230134Z     ) -> None:
2025-05-07T20:32:38.0230349Z         torch.manual_seed(2025)
2025-05-07T20:32:38.0230593Z     
2025-05-07T20:32:38.0230952Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.0231296Z     
2025-05-07T20:32:38.0231495Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.0231790Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.0232090Z         x = x_sign * x_clamp
2025-05-07T20:32:38.0232333Z         x0 = x[:, :D]
2025-05-07T20:32:38.0232601Z         x1 = x[:, D:]
2025-05-07T20:32:38.0232808Z     
2025-05-07T20:32:38.0232997Z         if contiguous:
2025-05-07T20:32:38.0233228Z             x0 = x0.contiguous()
2025-05-07T20:32:38.0233484Z             x1 = x1.contiguous()
2025-05-07T20:32:38.0240490Z     
2025-05-07T20:32:38.0240703Z         if scale_ub is not None:
2025-05-07T20:32:38.0240994Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.0241342Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.0241658Z             )
2025-05-07T20:32:38.0241858Z         else:
2025-05-07T20:32:38.0242065Z             scale_ub_tensor = None
2025-05-07T20:32:38.0242323Z     
2025-05-07T20:32:38.0242565Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.0242883Z             op = silu_mul_quant
2025-05-07T20:32:38.0243139Z             if compiled:
2025-05-07T20:32:38.0243399Z                 op = torch.compile(op)
2025-05-07T20:32:38.0243703Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.0244052Z     
2025-05-07T20:32:38.0244255Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.0244423Z 
2025-05-07T20:32:38.0244534Z moe/activation_test.py:117: 
2025-05-07T20:32:38.0244834Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.0245171Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.0245460Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.0246023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:38.0246593Z     return fn(*args, **kwargs)
2025-05-07T20:32:38.0247360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.0248126Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.0248672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.0249374Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.0250043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.0250575Z     kernel = self.compile(
2025-05-07T20:32:38.0251128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.0251799Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.0252203Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.0252433Z 
2025-05-07T20:32:38.0252645Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8ce7b80>
2025-05-07T20:32:38.0253748Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.0255168Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8ceeee0>}
2025-05-07T20:32:38.0256541Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.0257639Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8d5c3f0>
2025-05-07T20:32:38.0257930Z 
2025-05-07T20:32:38.0258100Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.0258682Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.0259157Z                            module_map=module_map)
2025-05-07T20:32:38.0259522Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.0259924Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.0260188Z E       ^
2025-05-07T20:32:38.0260660Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.0261120Z 
2025-05-07T20:32:38.0261543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.0262075Z 
2025-05-07T20:32:38.2924925Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.2925595Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.2926156Z     T=1,
2025-05-07T20:32:38.2926402Z     D=5120,
2025-05-07T20:32:38.2926644Z     scale_ub=None,
2025-05-07T20:32:38.2926870Z     contiguous=False,
2025-05-07T20:32:38.2927093Z     compiled=False,
2025-05-07T20:32:38.2927306Z )
2025-05-07T20:32:38.2927630Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.2928416Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:38.2928704Z 
2025-05-07T20:32:38.2928783Z     @given(
2025-05-07T20:32:38.2929023Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.2929349Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.2929666Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.2930003Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.2930352Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.2930647Z     )
2025-05-07T20:32:38.2930998Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.2931451Z     def test_silu_mul_quant(
2025-05-07T20:32:38.2931700Z         self,
2025-05-07T20:32:38.2931902Z         T: int,
2025-05-07T20:32:38.2932102Z         D: int,
2025-05-07T20:32:38.2932328Z         scale_ub: Optional[float],
2025-05-07T20:32:38.2932606Z         contiguous: bool,
2025-05-07T20:32:38.2932843Z         compiled: bool,
2025-05-07T20:32:38.2933081Z     ) -> None:
2025-05-07T20:32:38.2933303Z         torch.manual_seed(2025)
2025-05-07T20:32:38.2933540Z     
2025-05-07T20:32:38.2933820Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.2934174Z     
2025-05-07T20:32:38.2934371Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.2934664Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.2935013Z         x = x_sign * x_clamp
2025-05-07T20:32:38.2935271Z         x0 = x[:, :D]
2025-05-07T20:32:38.2935496Z         x1 = x[:, D:]
2025-05-07T20:32:38.2935709Z     
2025-05-07T20:32:38.2935891Z         if contiguous:
2025-05-07T20:32:38.2936133Z             x0 = x0.contiguous()
2025-05-07T20:32:38.2936392Z             x1 = x1.contiguous()
2025-05-07T20:32:38.2936634Z     
2025-05-07T20:32:38.2936828Z         if scale_ub is not None:
2025-05-07T20:32:38.2937106Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.2937451Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.2937762Z             )
2025-05-07T20:32:38.2937967Z         else:
2025-05-07T20:32:38.2938186Z             scale_ub_tensor = None
2025-05-07T20:32:38.2938431Z     
2025-05-07T20:32:38.2938664Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.2938985Z             op = silu_mul_quant
2025-05-07T20:32:38.2939232Z             if compiled:
2025-05-07T20:32:38.2939575Z                 op = torch.compile(op)
2025-05-07T20:32:38.2939886Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.2940162Z     
2025-05-07T20:32:38.2940360Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.2940525Z 
2025-05-07T20:32:38.2940717Z moe/activation_test.py:117: 
2025-05-07T20:32:38.2941012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.2941345Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.2941639Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.2942340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.2943116Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.2943663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.2944361Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.2945036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.2945580Z     kernel = self.compile(
2025-05-07T20:32:38.2946135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.2946796Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.2947192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.2947427Z 
2025-05-07T20:32:38.2947679Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8cd8a60>
2025-05-07T20:32:38.2948776Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.2950299Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8d595e0>}
2025-05-07T20:32:38.2951667Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.2952690Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8b672f0>
2025-05-07T20:32:38.2952987Z 
2025-05-07T20:32:38.2953154Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.2953691Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.2954159Z                            module_map=module_map)
2025-05-07T20:32:38.2954533Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.2954933Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.2955196Z E       ^
2025-05-07T20:32:38.2955656Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.2956120Z 
2025-05-07T20:32:38.2956541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.2957056Z 
2025-05-07T20:32:38.2957165Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.2957581Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.2957983Z     T=4096,
2025-05-07T20:32:38.2958171Z     D=7168,
2025-05-07T20:32:38.2958369Z     scale_ub=1200.0,
2025-05-07T20:32:38.2958587Z     contiguous=False,
2025-05-07T20:32:38.2958811Z     compiled=False,
2025-05-07T20:32:38.2959015Z )
2025-05-07T20:32:38.2959326Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.2959822Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:38.2960154Z 
2025-05-07T20:32:38.2960236Z     @given(
2025-05-07T20:32:38.2960459Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.2960772Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.2961079Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.2961451Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.2961776Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.2962062Z     )
2025-05-07T20:32:38.2962409Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.2962891Z     def test_silu_mul_quant(
2025-05-07T20:32:38.2963135Z         self,
2025-05-07T20:32:38.2963326Z         T: int,
2025-05-07T20:32:38.2963514Z         D: int,
2025-05-07T20:32:38.2963736Z         scale_ub: Optional[float],
2025-05-07T20:32:38.2964012Z         contiguous: bool,
2025-05-07T20:32:38.2964242Z         compiled: bool,
2025-05-07T20:32:38.2964481Z     ) -> None:
2025-05-07T20:32:38.2964707Z         torch.manual_seed(2025)
2025-05-07T20:32:38.2964973Z     
2025-05-07T20:32:38.2965260Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.2965605Z     
2025-05-07T20:32:38.2965799Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.2966085Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.2966402Z         x = x_sign * x_clamp
2025-05-07T20:32:38.2966643Z         x0 = x[:, :D]
2025-05-07T20:32:38.2966859Z         x1 = x[:, D:]
2025-05-07T20:32:38.2967062Z     
2025-05-07T20:32:38.2967249Z         if contiguous:
2025-05-07T20:32:38.2967539Z             x0 = x0.contiguous()
2025-05-07T20:32:38.2967790Z             x1 = x1.contiguous()
2025-05-07T20:32:38.2968032Z     
2025-05-07T20:32:38.2968225Z         if scale_ub is not None:
2025-05-07T20:32:38.2968493Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.2968835Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.2969153Z             )
2025-05-07T20:32:38.2969337Z         else:
2025-05-07T20:32:38.2969552Z             scale_ub_tensor = None
2025-05-07T20:32:38.2969805Z     
2025-05-07T20:32:38.2970031Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.2970350Z             op = silu_mul_quant
2025-05-07T20:32:38.2970606Z             if compiled:
2025-05-07T20:32:38.2970848Z                 op = torch.compile(op)
2025-05-07T20:32:38.2971144Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.2971418Z     
2025-05-07T20:32:38.2971603Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.2971773Z 
2025-05-07T20:32:38.2971876Z moe/activation_test.py:117: 
2025-05-07T20:32:38.2972175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.2972506Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.2972788Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.2973486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.2974189Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.2974748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.2975445Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.2976110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.2976650Z     kernel = self.compile(
2025-05-07T20:32:38.2977199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.2977856Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.2978256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.2978498Z 
2025-05-07T20:32:38.2978706Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8909d90>
2025-05-07T20:32:38.2979896Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.2981303Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc86f51f0>}
2025-05-07T20:32:38.2982672Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.2983756Z context = <triton._C.libtriton.ir.context object at 0x7f9fc86cd670>
2025-05-07T20:32:38.2984051Z 
2025-05-07T20:32:38.2984217Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.2984777Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.2985276Z                            module_map=module_map)
2025-05-07T20:32:38.2985646Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.2986010Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.2986267Z E       ^
2025-05-07T20:32:38.2986738Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.2987202Z 
2025-05-07T20:32:38.2987669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.2988191Z 
2025-05-07T20:32:38.2988301Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.2988713Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.2989125Z     T=16384,
2025-05-07T20:32:38.2989320Z     D=7168,
2025-05-07T20:32:38.2989510Z     scale_ub=None,
2025-05-07T20:32:38.2989729Z     contiguous=True,
2025-05-07T20:32:38.2990009Z     compiled=True,
2025-05-07T20:32:38.2990201Z )
2025-05-07T20:32:38.5824150Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.5824981Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:38.5825762Z 
2025-05-07T20:32:38.5825973Z     @given(
2025-05-07T20:32:38.5826418Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.5827029Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.5827626Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.5828283Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.5828936Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.5829494Z     )
2025-05-07T20:32:38.5830303Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.5831160Z     def test_silu_mul_quant(
2025-05-07T20:32:38.5831640Z         self,
2025-05-07T20:32:38.5832014Z         T: int,
2025-05-07T20:32:38.5832379Z         D: int,
2025-05-07T20:32:38.5832805Z         scale_ub: Optional[float],
2025-05-07T20:32:38.5833332Z         contiguous: bool,
2025-05-07T20:32:38.5833785Z         compiled: bool,
2025-05-07T20:32:38.5834230Z     ) -> None:
2025-05-07T20:32:38.5834580Z         torch.manual_seed(2025)
2025-05-07T20:32:38.5834817Z     
2025-05-07T20:32:38.5835088Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.5835433Z     
2025-05-07T20:32:38.5835624Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.5835932Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.5836241Z         x = x_sign * x_clamp
2025-05-07T20:32:38.5836476Z         x0 = x[:, :D]
2025-05-07T20:32:38.5836699Z         x1 = x[:, D:]
2025-05-07T20:32:38.5836914Z     
2025-05-07T20:32:38.5837106Z         if contiguous:
2025-05-07T20:32:38.5837336Z             x0 = x0.contiguous()
2025-05-07T20:32:38.5837863Z             x1 = x1.contiguous()
2025-05-07T20:32:38.5838108Z     
2025-05-07T20:32:38.5838295Z         if scale_ub is not None:
2025-05-07T20:32:38.5838577Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.5839039Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.5839349Z             )
2025-05-07T20:32:38.5839545Z         else:
2025-05-07T20:32:38.5839763Z             scale_ub_tensor = None
2025-05-07T20:32:38.5840010Z     
2025-05-07T20:32:38.5840245Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.5840561Z             op = silu_mul_quant
2025-05-07T20:32:38.5840886Z             if compiled:
2025-05-07T20:32:38.5841134Z                 op = torch.compile(op)
2025-05-07T20:32:38.5841436Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.5841704Z     
2025-05-07T20:32:38.5841897Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.5842067Z 
2025-05-07T20:32:38.5842188Z moe/activation_test.py:117: 
2025-05-07T20:32:38.5842485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.5842818Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.5843092Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.5843661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:38.5844222Z     return fn(*args, **kwargs)
2025-05-07T20:32:38.5844879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.5845655Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.5846381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.5847066Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.5847721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.5848259Z     kernel = self.compile(
2025-05-07T20:32:38.5848803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.5849454Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.5849857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.5850091Z 
2025-05-07T20:32:38.5850299Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc86fc700>
2025-05-07T20:32:38.5851393Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.5852804Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc86f5ee0>}
2025-05-07T20:32:38.5854151Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.5855182Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8ebf930>
2025-05-07T20:32:38.5855476Z 
2025-05-07T20:32:38.5855642Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.5856167Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.5856637Z                            module_map=module_map)
2025-05-07T20:32:38.5857009Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.5857369Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.5857624Z E       ^
2025-05-07T20:32:38.5858098Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.5858614Z 
2025-05-07T20:32:38.5859031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.5859542Z 
2025-05-07T20:32:38.5859654Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.5860118Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.5860541Z     T=4096,
2025-05-07T20:32:38.5860738Z     D=5120,
2025-05-07T20:32:38.5860932Z     scale_ub=None,
2025-05-07T20:32:38.5861156Z     contiguous=False,
2025-05-07T20:32:38.5861390Z     compiled=True,
2025-05-07T20:32:38.5861643Z )
2025-05-07T20:32:38.5861968Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.5862472Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:38.5862741Z 
2025-05-07T20:32:38.5862823Z     @given(
2025-05-07T20:32:38.5863052Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.5863373Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.5863690Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.5864014Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.5864357Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.5864653Z     )
2025-05-07T20:32:38.5865001Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.5865444Z     def test_silu_mul_quant(
2025-05-07T20:32:38.5865689Z         self,
2025-05-07T20:32:38.5865885Z         T: int,
2025-05-07T20:32:38.5866640Z         D: int,
2025-05-07T20:32:38.5866865Z         scale_ub: Optional[float],
2025-05-07T20:32:38.5867139Z         contiguous: bool,
2025-05-07T20:32:38.5867381Z         compiled: bool,
2025-05-07T20:32:38.5867606Z     ) -> None:
2025-05-07T20:32:38.5867827Z         torch.manual_seed(2025)
2025-05-07T20:32:38.5868064Z     
2025-05-07T20:32:38.5868348Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.5868706Z     
2025-05-07T20:32:38.5868897Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.5869189Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.5869506Z         x = x_sign * x_clamp
2025-05-07T20:32:38.5869745Z         x0 = x[:, :D]
2025-05-07T20:32:38.5870039Z         x1 = x[:, D:]
2025-05-07T20:32:38.5870248Z     
2025-05-07T20:32:38.5870428Z         if contiguous:
2025-05-07T20:32:38.5870662Z             x0 = x0.contiguous()
2025-05-07T20:32:38.5870922Z             x1 = x1.contiguous()
2025-05-07T20:32:38.5871158Z     
2025-05-07T20:32:38.5871358Z         if scale_ub is not None:
2025-05-07T20:32:38.5871632Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.5871965Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.5872274Z             )
2025-05-07T20:32:38.5872470Z         else:
2025-05-07T20:32:38.5872682Z             scale_ub_tensor = None
2025-05-07T20:32:38.5872934Z     
2025-05-07T20:32:38.5873172Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.5873488Z             op = silu_mul_quant
2025-05-07T20:32:38.5873741Z             if compiled:
2025-05-07T20:32:38.5873990Z                 op = torch.compile(op)
2025-05-07T20:32:38.5874291Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.5874563Z     
2025-05-07T20:32:38.5874756Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.5874920Z 
2025-05-07T20:32:38.5875028Z moe/activation_test.py:117: 
2025-05-07T20:32:38.5875318Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.5875655Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.5875936Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.5876494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:38.5877046Z     return fn(*args, **kwargs)
2025-05-07T20:32:38.5877714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.5878455Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.5879021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.5879706Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.5880367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.5880899Z     kernel = self.compile(
2025-05-07T20:32:38.5881478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.5882131Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.5882532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.5882758Z 
2025-05-07T20:32:38.5882974Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8eb7610>
2025-05-07T20:32:38.5884072Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.5885458Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8ebe940>}
2025-05-07T20:32:38.5886853Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.5887883Z context = <triton._C.libtriton.ir.context object at 0x7f9fc88b3970>
2025-05-07T20:32:38.5888169Z 
2025-05-07T20:32:38.5888335Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.5888870Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.5889341Z                            module_map=module_map)
2025-05-07T20:32:38.5889729Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.5890080Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.5890345Z E       ^
2025-05-07T20:32:38.5890808Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.5898626Z 
2025-05-07T20:32:38.5899103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.5899631Z 
2025-05-07T20:32:38.7848716Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.7849347Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.7850108Z     T=4096,
2025-05-07T20:32:38.7850371Z     D=5120,
2025-05-07T20:32:38.7850653Z     scale_ub=1200.0,
2025-05-07T20:32:38.7850913Z     contiguous=False,
2025-05-07T20:32:38.7851136Z     compiled=False,
2025-05-07T20:32:38.7851336Z )
2025-05-07T20:32:38.7851661Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.7852165Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:38.7852443Z 
2025-05-07T20:32:38.7852520Z     @given(
2025-05-07T20:32:38.7852750Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.7853064Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.7853375Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.7853710Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.7854039Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.7854335Z     )
2025-05-07T20:32:38.7854687Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.7855421Z     def test_silu_mul_quant(
2025-05-07T20:32:38.7855663Z         self,
2025-05-07T20:32:38.7855850Z         T: int,
2025-05-07T20:32:38.7856055Z         D: int,
2025-05-07T20:32:38.7856277Z         scale_ub: Optional[float],
2025-05-07T20:32:38.7856551Z         contiguous: bool,
2025-05-07T20:32:38.7856876Z         compiled: bool,
2025-05-07T20:32:38.7857117Z     ) -> None:
2025-05-07T20:32:38.7857328Z         torch.manual_seed(2025)
2025-05-07T20:32:38.7857573Z     
2025-05-07T20:32:38.7857851Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.7858189Z     
2025-05-07T20:32:38.7858465Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.7858760Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.7859064Z         x = x_sign * x_clamp
2025-05-07T20:32:38.7859307Z         x0 = x[:, :D]
2025-05-07T20:32:38.7859527Z         x1 = x[:, D:]
2025-05-07T20:32:38.7859743Z     
2025-05-07T20:32:38.7859924Z         if contiguous:
2025-05-07T20:32:38.7860160Z             x0 = x0.contiguous()
2025-05-07T20:32:38.7860419Z             x1 = x1.contiguous()
2025-05-07T20:32:38.7860650Z     
2025-05-07T20:32:38.7860843Z         if scale_ub is not None:
2025-05-07T20:32:38.7861124Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.7861460Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.7861767Z             )
2025-05-07T20:32:38.7861964Z         else:
2025-05-07T20:32:38.7862173Z             scale_ub_tensor = None
2025-05-07T20:32:38.7862433Z     
2025-05-07T20:32:38.7862697Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.7863096Z             op = silu_mul_quant
2025-05-07T20:32:38.7863350Z             if compiled:
2025-05-07T20:32:38.7863605Z                 op = torch.compile(op)
2025-05-07T20:32:38.7863913Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.7864187Z     
2025-05-07T20:32:38.7864383Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.7864561Z 
2025-05-07T20:32:38.7864668Z moe/activation_test.py:117: 
2025-05-07T20:32:38.7864974Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.7865357Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.7865642Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.7866337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.7867027Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.7867563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.7868253Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.7868926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.7869455Z     kernel = self.compile(
2025-05-07T20:32:38.7870094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.7870756Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.7871163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.7871392Z 
2025-05-07T20:32:38.7871599Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc87971f0>
2025-05-07T20:32:38.7872701Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.7874102Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc892d3a0>}
2025-05-07T20:32:38.7875456Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.7876536Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8936930>
2025-05-07T20:32:38.7876825Z 
2025-05-07T20:32:38.7877030Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.7877555Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.7878018Z                            module_map=module_map)
2025-05-07T20:32:38.7878379Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.7878776Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.7879035Z E       ^
2025-05-07T20:32:38.7879499Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.7879963Z 
2025-05-07T20:32:38.7880382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.7880904Z 
2025-05-07T20:32:38.7881006Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.7881425Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.7881825Z     T=4096,
2025-05-07T20:32:38.7882015Z     D=5120,
2025-05-07T20:32:38.7882207Z     scale_ub=1200.0,
2025-05-07T20:32:38.7882423Z     contiguous=False,
2025-05-07T20:32:38.7882650Z     compiled=True,
2025-05-07T20:32:38.7882856Z )
2025-05-07T20:32:38.7883172Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.7883719Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:38.7884001Z 
2025-05-07T20:32:38.7884073Z     @given(
2025-05-07T20:32:38.7884305Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.7884606Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.7884967Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.7885310Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.7885632Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.7885917Z     )
2025-05-07T20:32:38.7886267Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.7886705Z     def test_silu_mul_quant(
2025-05-07T20:32:38.7886941Z         self,
2025-05-07T20:32:38.7887131Z         T: int,
2025-05-07T20:32:38.7887326Z         D: int,
2025-05-07T20:32:38.7887533Z         scale_ub: Optional[float],
2025-05-07T20:32:38.7887809Z         contiguous: bool,
2025-05-07T20:32:38.7888056Z         compiled: bool,
2025-05-07T20:32:38.7888273Z     ) -> None:
2025-05-07T20:32:38.7888496Z         torch.manual_seed(2025)
2025-05-07T20:32:38.7888748Z     
2025-05-07T20:32:38.7889017Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.7889369Z     
2025-05-07T20:32:38.7889560Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.7889848Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.7890167Z         x = x_sign * x_clamp
2025-05-07T20:32:38.7890419Z         x0 = x[:, :D]
2025-05-07T20:32:38.7890635Z         x1 = x[:, D:]
2025-05-07T20:32:38.7890848Z     
2025-05-07T20:32:38.7891041Z         if contiguous:
2025-05-07T20:32:38.7891273Z             x0 = x0.contiguous()
2025-05-07T20:32:38.7891535Z             x1 = x1.contiguous()
2025-05-07T20:32:38.7891779Z     
2025-05-07T20:32:38.7891965Z         if scale_ub is not None:
2025-05-07T20:32:38.7892236Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.7892577Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.7892888Z             )
2025-05-07T20:32:38.7893072Z         else:
2025-05-07T20:32:38.7893285Z             scale_ub_tensor = None
2025-05-07T20:32:38.7893535Z     
2025-05-07T20:32:38.7893757Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.7894071Z             op = silu_mul_quant
2025-05-07T20:32:38.7894375Z             if compiled:
2025-05-07T20:32:38.7894619Z                 op = torch.compile(op)
2025-05-07T20:32:38.7894920Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.7895215Z     
2025-05-07T20:32:38.7895462Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.7895633Z 
2025-05-07T20:32:38.7895734Z moe/activation_test.py:117: 
2025-05-07T20:32:38.7896025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.7896353Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.7896626Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.7897220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:38.7897775Z     return fn(*args, **kwargs)
2025-05-07T20:32:38.7898424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.7899112Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.7899640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.7900313Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.7900966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.7901492Z     kernel = self.compile(
2025-05-07T20:32:38.7902024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.7902709Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.7903100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.7903331Z 
2025-05-07T20:32:38.7903535Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc87b0400>
2025-05-07T20:32:38.7904891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.7906275Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc892d280>}
2025-05-07T20:32:38.7907624Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.7908662Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8a487f0>
2025-05-07T20:32:38.7908952Z 
2025-05-07T20:32:38.7909118Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.7909647Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.7910159Z                            module_map=module_map)
2025-05-07T20:32:38.7910523Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.7910877Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.7911129Z E       ^
2025-05-07T20:32:38.7911595Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.7912051Z 
2025-05-07T20:32:38.7912469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.7912984Z 
2025-05-07T20:32:39.0676130Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.0676688Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.0677165Z     T=2048,
2025-05-07T20:32:39.0677353Z     D=7168,
2025-05-07T20:32:39.0677545Z     scale_ub=1200.0,
2025-05-07T20:32:39.0677767Z     contiguous=False,
2025-05-07T20:32:39.0677993Z     compiled=False,
2025-05-07T20:32:39.0678441Z )
2025-05-07T20:32:39.0678751Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.0679247Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:39.0679545Z 
2025-05-07T20:32:39.0679713Z     @given(
2025-05-07T20:32:39.0679946Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.0680262Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.0680565Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.0680900Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.0681316Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.0681597Z     )
2025-05-07T20:32:39.0681950Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.0682391Z     def test_silu_mul_quant(
2025-05-07T20:32:39.0682636Z         self,
2025-05-07T20:32:39.0682827Z         T: int,
2025-05-07T20:32:39.0683031Z         D: int,
2025-05-07T20:32:39.0683250Z         scale_ub: Optional[float],
2025-05-07T20:32:39.0683517Z         contiguous: bool,
2025-05-07T20:32:39.0683759Z         compiled: bool,
2025-05-07T20:32:39.0683989Z     ) -> None:
2025-05-07T20:32:39.0684205Z         torch.manual_seed(2025)
2025-05-07T20:32:39.0684446Z     
2025-05-07T20:32:39.0684719Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.0685084Z     
2025-05-07T20:32:39.0685310Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.0685601Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.0685989Z         x = x_sign * x_clamp
2025-05-07T20:32:39.0686234Z         x0 = x[:, :D]
2025-05-07T20:32:39.0686452Z         x1 = x[:, D:]
2025-05-07T20:32:39.0686653Z     
2025-05-07T20:32:39.0686840Z         if contiguous:
2025-05-07T20:32:39.0687074Z             x0 = x0.contiguous()
2025-05-07T20:32:39.0687328Z             x1 = x1.contiguous()
2025-05-07T20:32:39.0687572Z     
2025-05-07T20:32:39.0687763Z         if scale_ub is not None:
2025-05-07T20:32:39.0688032Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.0688376Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.0688683Z             )
2025-05-07T20:32:39.0688879Z         else:
2025-05-07T20:32:39.0689085Z             scale_ub_tensor = None
2025-05-07T20:32:39.0689336Z     
2025-05-07T20:32:39.0689572Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.0689886Z             op = silu_mul_quant
2025-05-07T20:32:39.0690139Z             if compiled:
2025-05-07T20:32:39.0690395Z                 op = torch.compile(op)
2025-05-07T20:32:39.0690690Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.0690968Z     
2025-05-07T20:32:39.0691166Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.0691330Z 
2025-05-07T20:32:39.0691432Z moe/activation_test.py:117: 
2025-05-07T20:32:39.0691736Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.0692070Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.0692358Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.0693055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.0693765Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.0694306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.0694981Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.0695650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.0696184Z     kernel = self.compile(
2025-05-07T20:32:39.0696725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.0697369Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.0697858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.0698084Z 
2025-05-07T20:32:39.0698298Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc891f2e0>
2025-05-07T20:32:39.0699434Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.0701079Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8a0e670>}
2025-05-07T20:32:39.0702489Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.0703518Z context = <triton._C.libtriton.ir.context object at 0x7f9fc882dab0>
2025-05-07T20:32:39.0704086Z 
2025-05-07T20:32:39.0704255Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.0704780Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.0705292Z                            module_map=module_map)
2025-05-07T20:32:39.0705667Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.0706027Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.0706278Z E       ^
2025-05-07T20:32:39.0706819Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.0707380Z 
2025-05-07T20:32:39.0707883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.0708509Z 
2025-05-07T20:32:39.0708627Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.0709128Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.0709588Z     T=1,
2025-05-07T20:32:39.0709780Z     D=7168,
2025-05-07T20:32:39.0710087Z     scale_ub=None,
2025-05-07T20:32:39.0710296Z     contiguous=True,
2025-05-07T20:32:39.0710522Z     compiled=False,
2025-05-07T20:32:39.0710727Z )
2025-05-07T20:32:39.0711039Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.0711524Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:39.0711787Z 
2025-05-07T20:32:39.0711869Z     @given(
2025-05-07T20:32:39.0712107Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.0712414Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.0712721Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.0713053Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.0713377Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.0713666Z     )
2025-05-07T20:32:39.0714015Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.0714447Z     def test_silu_mul_quant(
2025-05-07T20:32:39.0714691Z         self,
2025-05-07T20:32:39.0714887Z         T: int,
2025-05-07T20:32:39.0715083Z         D: int,
2025-05-07T20:32:39.0715345Z         scale_ub: Optional[float],
2025-05-07T20:32:39.0715627Z         contiguous: bool,
2025-05-07T20:32:39.0715872Z         compiled: bool,
2025-05-07T20:32:39.0716093Z     ) -> None:
2025-05-07T20:32:39.0716310Z         torch.manual_seed(2025)
2025-05-07T20:32:39.0716561Z     
2025-05-07T20:32:39.0716832Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.0717175Z     
2025-05-07T20:32:39.0717371Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.0717661Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.0717979Z         x = x_sign * x_clamp
2025-05-07T20:32:39.0718298Z         x0 = x[:, :D]
2025-05-07T20:32:39.0718511Z         x1 = x[:, D:]
2025-05-07T20:32:39.0718722Z     
2025-05-07T20:32:39.0718912Z         if contiguous:
2025-05-07T20:32:39.0719138Z             x0 = x0.contiguous()
2025-05-07T20:32:39.0719398Z             x1 = x1.contiguous()
2025-05-07T20:32:39.0719711Z     
2025-05-07T20:32:39.0719901Z         if scale_ub is not None:
2025-05-07T20:32:39.0720183Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.0720518Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.0720830Z             )
2025-05-07T20:32:39.0721082Z         else:
2025-05-07T20:32:39.0721299Z             scale_ub_tensor = None
2025-05-07T20:32:39.0721556Z     
2025-05-07T20:32:39.0721784Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.0722100Z             op = silu_mul_quant
2025-05-07T20:32:39.0722352Z             if compiled:
2025-05-07T20:32:39.0722595Z                 op = torch.compile(op)
2025-05-07T20:32:39.0722897Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.0723176Z     
2025-05-07T20:32:39.0723367Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.0723542Z 
2025-05-07T20:32:39.0723640Z moe/activation_test.py:117: 
2025-05-07T20:32:39.0723941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.0724267Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.0724556Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.0725300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.0726000Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.0726534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.0727211Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.0727872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.0728406Z     kernel = self.compile(
2025-05-07T20:32:39.0728937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.0729590Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.0729989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.0730217Z 
2025-05-07T20:32:39.0730428Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8a164c0>
2025-05-07T20:32:39.0731520Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.0732904Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8962280>}
2025-05-07T20:32:39.0734252Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.0735327Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8985530>
2025-05-07T20:32:39.0735618Z 
2025-05-07T20:32:39.0735784Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.0736313Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.0736795Z                            module_map=module_map)
2025-05-07T20:32:39.0737175Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.0737526Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.0737794Z E       ^
2025-05-07T20:32:39.0738274Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.0738774Z 
2025-05-07T20:32:39.0739186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.0739702Z 
2025-05-07T20:32:39.0739850Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.0740272Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.0740679Z     T=16384,
2025-05-07T20:32:39.0740867Z     D=7168,
2025-05-07T20:32:39.0741070Z     scale_ub=1200.0,
2025-05-07T20:32:39.0741296Z     contiguous=False,
2025-05-07T20:32:39.0741561Z     compiled=True,
2025-05-07T20:32:39.0741770Z )
2025-05-07T20:32:39.2661769Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.2663082Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:39.2663646Z 
2025-05-07T20:32:39.2663811Z     @given(
2025-05-07T20:32:39.2664300Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.2664923Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.2665290Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.2665638Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.2665976Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.2666266Z     )
2025-05-07T20:32:39.2666618Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.2667057Z     def test_silu_mul_quant(
2025-05-07T20:32:39.2667303Z         self,
2025-05-07T20:32:39.2667745Z         T: int,
2025-05-07T20:32:39.2667946Z         D: int,
2025-05-07T20:32:39.2668176Z         scale_ub: Optional[float],
2025-05-07T20:32:39.2668454Z         contiguous: bool,
2025-05-07T20:32:39.2668691Z         compiled: bool,
2025-05-07T20:32:39.2668920Z     ) -> None:
2025-05-07T20:32:39.2669143Z         torch.manual_seed(2025)
2025-05-07T20:32:39.2669383Z     
2025-05-07T20:32:39.2669670Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.2670120Z     
2025-05-07T20:32:39.2670313Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.2670608Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.2670926Z         x = x_sign * x_clamp
2025-05-07T20:32:39.2671167Z         x0 = x[:, :D]
2025-05-07T20:32:39.2671390Z         x1 = x[:, D:]
2025-05-07T20:32:39.2671608Z     
2025-05-07T20:32:39.2671792Z         if contiguous:
2025-05-07T20:32:39.2672045Z             x0 = x0.contiguous()
2025-05-07T20:32:39.2672315Z             x1 = x1.contiguous()
2025-05-07T20:32:39.2672569Z     
2025-05-07T20:32:39.2672766Z         if scale_ub is not None:
2025-05-07T20:32:39.2681110Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.2681506Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.2681828Z             )
2025-05-07T20:32:39.2682019Z         else:
2025-05-07T20:32:39.2682242Z             scale_ub_tensor = None
2025-05-07T20:32:39.2682512Z     
2025-05-07T20:32:39.2682746Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.2683069Z             op = silu_mul_quant
2025-05-07T20:32:39.2683329Z             if compiled:
2025-05-07T20:32:39.2683593Z                 op = torch.compile(op)
2025-05-07T20:32:39.2683897Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.2684184Z     
2025-05-07T20:32:39.2684392Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.2684560Z 
2025-05-07T20:32:39.2684665Z moe/activation_test.py:117: 
2025-05-07T20:32:39.2685012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.2685365Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.2685654Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.2686232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:39.2686790Z     return fn(*args, **kwargs)
2025-05-07T20:32:39.2687602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.2688297Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.2688923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.2689618Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.2690284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.2690914Z     kernel = self.compile(
2025-05-07T20:32:39.2691468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.2692132Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.2692538Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.2692782Z 
2025-05-07T20:32:39.2692993Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc897bd90>
2025-05-07T20:32:39.2694091Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.2695543Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8962ee0>}
2025-05-07T20:32:39.2696938Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.2697974Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8863330>
2025-05-07T20:32:39.2698272Z 
2025-05-07T20:32:39.2698445Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.2698979Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.2699450Z                            module_map=module_map)
2025-05-07T20:32:39.2699830Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.2700203Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.2700474Z E       ^
2025-05-07T20:32:39.2700945Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.2701414Z 
2025-05-07T20:32:39.2701837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.2702351Z 
2025-05-07T20:32:39.2702469Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.2702896Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.2703311Z     T=1,
2025-05-07T20:32:39.2703505Z     D=7168,
2025-05-07T20:32:39.2704057Z     scale_ub=None,
2025-05-07T20:32:39.2704315Z     contiguous=False,
2025-05-07T20:32:39.2704551Z     compiled=False,
2025-05-07T20:32:39.2704770Z )
2025-05-07T20:32:39.2705134Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.2705640Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:39.2705903Z 
2025-05-07T20:32:39.2705989Z     @given(
2025-05-07T20:32:39.2706219Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.2706540Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.2706857Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.2707195Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.2707523Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.2707817Z     )
2025-05-07T20:32:39.2708171Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.2708712Z     def test_silu_mul_quant(
2025-05-07T20:32:39.2708960Z         self,
2025-05-07T20:32:39.2709162Z         T: int,
2025-05-07T20:32:39.2709356Z         D: int,
2025-05-07T20:32:39.2709576Z         scale_ub: Optional[float],
2025-05-07T20:32:39.2709996Z         contiguous: bool,
2025-05-07T20:32:39.2710244Z         compiled: bool,
2025-05-07T20:32:39.2710465Z     ) -> None:
2025-05-07T20:32:39.2710685Z         torch.manual_seed(2025)
2025-05-07T20:32:39.2710931Z     
2025-05-07T20:32:39.2711200Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.2711616Z     
2025-05-07T20:32:39.2711814Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.2712106Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.2712423Z         x = x_sign * x_clamp
2025-05-07T20:32:39.2712666Z         x0 = x[:, :D]
2025-05-07T20:32:39.2712880Z         x1 = x[:, D:]
2025-05-07T20:32:39.2713093Z     
2025-05-07T20:32:39.2713286Z         if contiguous:
2025-05-07T20:32:39.2713512Z             x0 = x0.contiguous()
2025-05-07T20:32:39.2713779Z             x1 = x1.contiguous()
2025-05-07T20:32:39.2714021Z     
2025-05-07T20:32:39.2714209Z         if scale_ub is not None:
2025-05-07T20:32:39.2714491Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.2714850Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.2715188Z             )
2025-05-07T20:32:39.2715376Z         else:
2025-05-07T20:32:39.2715586Z             scale_ub_tensor = None
2025-05-07T20:32:39.2715840Z     
2025-05-07T20:32:39.2716138Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.2716450Z             op = silu_mul_quant
2025-05-07T20:32:39.2716697Z             if compiled:
2025-05-07T20:32:39.2716933Z                 op = torch.compile(op)
2025-05-07T20:32:39.2717224Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.2717493Z     
2025-05-07T20:32:39.2717681Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.2717851Z 
2025-05-07T20:32:39.2717947Z moe/activation_test.py:117: 
2025-05-07T20:32:39.2718239Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.2718561Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.2718840Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.2719528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.2720222Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.2720758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.2721437Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.2722093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.2722635Z     kernel = self.compile(
2025-05-07T20:32:39.2723169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.2723823Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.2724216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.2724439Z 
2025-05-07T20:32:39.2724644Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc887a520>
2025-05-07T20:32:39.2725783Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.2727166Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc887b670>}
2025-05-07T20:32:39.2728512Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.2729591Z context = <triton._C.libtriton.ir.context object at 0x7f9fc85c8f70>
2025-05-07T20:32:39.2729880Z 
2025-05-07T20:32:39.2730090Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.2730620Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.2731093Z                            module_map=module_map)
2025-05-07T20:32:39.2731515Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.2731874Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.2732138Z E       ^
2025-05-07T20:32:39.2732611Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.2733064Z 
2025-05-07T20:32:39.2733479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.2734002Z 
2025-05-07T20:32:39.2734110Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.2734529Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.2734939Z     T=2048,
2025-05-07T20:32:39.2735129Z     D=7168,
2025-05-07T20:32:39.2735316Z     scale_ub=None,
2025-05-07T20:32:39.2735532Z     contiguous=False,
2025-05-07T20:32:39.2735751Z     compiled=True,
2025-05-07T20:32:39.2735956Z )
2025-05-07T20:32:39.5599255Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.5599821Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:39.5600092Z 
2025-05-07T20:32:39.5600166Z     @given(
2025-05-07T20:32:39.5600396Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.5600707Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.5601010Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.5601341Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.5601667Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.5601941Z     )
2025-05-07T20:32:39.5602293Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.5602731Z     def test_silu_mul_quant(
2025-05-07T20:32:39.5602970Z         self,
2025-05-07T20:32:39.5603155Z         T: int,
2025-05-07T20:32:39.5603349Z         D: int,
2025-05-07T20:32:39.5603566Z         scale_ub: Optional[float],
2025-05-07T20:32:39.5604085Z         contiguous: bool,
2025-05-07T20:32:39.5604320Z         compiled: bool,
2025-05-07T20:32:39.5604543Z     ) -> None:
2025-05-07T20:32:39.5604755Z         torch.manual_seed(2025)
2025-05-07T20:32:39.5605000Z     
2025-05-07T20:32:39.5605311Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.5605655Z     
2025-05-07T20:32:39.5605858Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.5606151Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.5606451Z         x = x_sign * x_clamp
2025-05-07T20:32:39.5606694Z         x0 = x[:, :D]
2025-05-07T20:32:39.5606907Z         x1 = x[:, D:]
2025-05-07T20:32:39.5607110Z     
2025-05-07T20:32:39.5607294Z         if contiguous:
2025-05-07T20:32:39.5607533Z             x0 = x0.contiguous()
2025-05-07T20:32:39.5607783Z             x1 = x1.contiguous()
2025-05-07T20:32:39.5608027Z     
2025-05-07T20:32:39.5608219Z         if scale_ub is not None:
2025-05-07T20:32:39.5608490Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.5608823Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.5609138Z             )
2025-05-07T20:32:39.5609332Z         else:
2025-05-07T20:32:39.5609545Z             scale_ub_tensor = None
2025-05-07T20:32:39.5609796Z     
2025-05-07T20:32:39.5610031Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.5610427Z             op = silu_mul_quant
2025-05-07T20:32:39.5610684Z             if compiled:
2025-05-07T20:32:39.5610927Z                 op = torch.compile(op)
2025-05-07T20:32:39.5611216Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.5611483Z     
2025-05-07T20:32:39.5611751Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.5611914Z 
2025-05-07T20:32:39.5612012Z moe/activation_test.py:117: 
2025-05-07T20:32:39.5612312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.5612644Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.5613045Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.5613618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:39.5614181Z     return fn(*args, **kwargs)
2025-05-07T20:32:39.5614845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.5615538Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.5616077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.5616767Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.5617419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.5617947Z     kernel = self.compile(
2025-05-07T20:32:39.5618545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.5619199Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.5619595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.5619827Z 
2025-05-07T20:32:39.5620033Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8605340>
2025-05-07T20:32:39.5621123Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.5622524Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc845c550>}
2025-05-07T20:32:39.5623866Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.5624900Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8489070>
2025-05-07T20:32:39.5625235Z 
2025-05-07T20:32:39.5625406Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.5625929Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.5626394Z                            module_map=module_map)
2025-05-07T20:32:39.5626764Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.5627109Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.5627368Z E       ^
2025-05-07T20:32:39.5627825Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.5628283Z 
2025-05-07T20:32:39.5628706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.5629221Z 
2025-05-07T20:32:39.5629325Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.5629742Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.5630193Z     T=4096,
2025-05-07T20:32:39.5630379Z     D=7168,
2025-05-07T20:32:39.5630567Z     scale_ub=None,
2025-05-07T20:32:39.5630773Z     contiguous=False,
2025-05-07T20:32:39.5631067Z     compiled=True,
2025-05-07T20:32:39.5631274Z )
2025-05-07T20:32:39.5631588Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.5632080Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:39.5632394Z 
2025-05-07T20:32:39.5632477Z     @given(
2025-05-07T20:32:39.5632705Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.5633016Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.5633319Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.5633699Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.5634026Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.5634313Z     )
2025-05-07T20:32:39.5634666Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.5635104Z     def test_silu_mul_quant(
2025-05-07T20:32:39.5635376Z         self,
2025-05-07T20:32:39.5635594Z         T: int,
2025-05-07T20:32:39.5635780Z         D: int,
2025-05-07T20:32:39.5635995Z         scale_ub: Optional[float],
2025-05-07T20:32:39.5636269Z         contiguous: bool,
2025-05-07T20:32:39.5636504Z         compiled: bool,
2025-05-07T20:32:39.5636738Z     ) -> None:
2025-05-07T20:32:39.5636957Z         torch.manual_seed(2025)
2025-05-07T20:32:39.5637194Z     
2025-05-07T20:32:39.5637464Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.5637808Z     
2025-05-07T20:32:39.5638000Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.5638342Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.5638664Z         x = x_sign * x_clamp
2025-05-07T20:32:39.5638904Z         x0 = x[:, :D]
2025-05-07T20:32:39.5639113Z         x1 = x[:, D:]
2025-05-07T20:32:39.5639321Z     
2025-05-07T20:32:39.5639506Z         if contiguous:
2025-05-07T20:32:39.5639728Z             x0 = x0.contiguous()
2025-05-07T20:32:39.5639990Z             x1 = x1.contiguous()
2025-05-07T20:32:39.5640244Z     
2025-05-07T20:32:39.5640431Z         if scale_ub is not None:
2025-05-07T20:32:39.5640716Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.5641063Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.5641368Z             )
2025-05-07T20:32:39.5641563Z         else:
2025-05-07T20:32:39.5641776Z             scale_ub_tensor = None
2025-05-07T20:32:39.5642016Z     
2025-05-07T20:32:39.5642245Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.5642557Z             op = silu_mul_quant
2025-05-07T20:32:39.5642803Z             if compiled:
2025-05-07T20:32:39.5643050Z                 op = torch.compile(op)
2025-05-07T20:32:39.5643344Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.5643623Z     
2025-05-07T20:32:39.5643803Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.5643972Z 
2025-05-07T20:32:39.5644073Z moe/activation_test.py:117: 
2025-05-07T20:32:39.5644372Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.5644696Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.5644977Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.5645548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:39.5646099Z     return fn(*args, **kwargs)
2025-05-07T20:32:39.5646758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.5647444Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.5647981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.5648652Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.5649313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.5649891Z     kernel = self.compile(
2025-05-07T20:32:39.5650421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.5651070Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.5651504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.5651730Z 
2025-05-07T20:32:39.5651940Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc844df40>
2025-05-07T20:32:39.5653027Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.5654467Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8550160>}
2025-05-07T20:32:39.5655878Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.5656911Z context = <triton._C.libtriton.ir.context object at 0x7f9fc85801f0>
2025-05-07T20:32:39.5657196Z 
2025-05-07T20:32:39.5657371Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.5657897Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.5658411Z                            module_map=module_map)
2025-05-07T20:32:39.5658787Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.5659133Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.5659393Z E       ^
2025-05-07T20:32:39.5659856Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.5660313Z 
2025-05-07T20:32:39.5660742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.5661258Z 
2025-05-07T20:32:39.7731328Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.7732185Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.7733237Z     T=16384,
2025-05-07T20:32:39.7733717Z     D=5120,
2025-05-07T20:32:39.7734148Z     scale_ub=1200.0,
2025-05-07T20:32:39.7734578Z     contiguous=False,
2025-05-07T20:32:39.7735021Z     compiled=False,
2025-05-07T20:32:39.7735354Z )
2025-05-07T20:32:39.7735733Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.7736232Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:39.7736525Z 
2025-05-07T20:32:39.7736604Z     @given(
2025-05-07T20:32:39.7736840Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.7737151Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.7737460Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.7737789Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.7738110Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.7738398Z     )
2025-05-07T20:32:39.7738748Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.7739180Z     def test_silu_mul_quant(
2025-05-07T20:32:39.7739427Z         self,
2025-05-07T20:32:39.7739624Z         T: int,
2025-05-07T20:32:39.7739823Z         D: int,
2025-05-07T20:32:39.7740043Z         scale_ub: Optional[float],
2025-05-07T20:32:39.7740320Z         contiguous: bool,
2025-05-07T20:32:39.7740557Z         compiled: bool,
2025-05-07T20:32:39.7740782Z     ) -> None:
2025-05-07T20:32:39.7741002Z         torch.manual_seed(2025)
2025-05-07T20:32:39.7741244Z     
2025-05-07T20:32:39.7741511Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.7742103Z     
2025-05-07T20:32:39.7742295Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.7742583Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.7742894Z         x = x_sign * x_clamp
2025-05-07T20:32:39.7743139Z         x0 = x[:, :D]
2025-05-07T20:32:39.7743433Z         x1 = x[:, D:]
2025-05-07T20:32:39.7743649Z     
2025-05-07T20:32:39.7743841Z         if contiguous:
2025-05-07T20:32:39.7744069Z             x0 = x0.contiguous()
2025-05-07T20:32:39.7744336Z             x1 = x1.contiguous()
2025-05-07T20:32:39.7744585Z     
2025-05-07T20:32:39.7744777Z         if scale_ub is not None:
2025-05-07T20:32:39.7745130Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.7745472Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.7745792Z             )
2025-05-07T20:32:39.7745985Z         else:
2025-05-07T20:32:39.7746202Z             scale_ub_tensor = None
2025-05-07T20:32:39.7746457Z     
2025-05-07T20:32:39.7746690Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.7747009Z             op = silu_mul_quant
2025-05-07T20:32:39.7747265Z             if compiled:
2025-05-07T20:32:39.7747507Z                 op = torch.compile(op)
2025-05-07T20:32:39.7747808Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.7748085Z     
2025-05-07T20:32:39.7748274Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.7748444Z 
2025-05-07T20:32:39.7748542Z moe/activation_test.py:117: 
2025-05-07T20:32:39.7748835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.7749243Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.7749525Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.7750325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.7751020Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.7751551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.7752238Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.7752899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.7753433Z     kernel = self.compile(
2025-05-07T20:32:39.7753968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.7754625Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.7755035Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.7755265Z 
2025-05-07T20:32:39.7755475Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc83da490>
2025-05-07T20:32:39.7756568Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.7757977Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8550940>}
2025-05-07T20:32:39.7759336Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.7760364Z context = <triton._C.libtriton.ir.context object at 0x7f9fc83d6fb0>
2025-05-07T20:32:39.7760653Z 
2025-05-07T20:32:39.7760817Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.7761340Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.7761806Z                            module_map=module_map)
2025-05-07T20:32:39.7762227Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.7762579Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.7762841Z E       ^
2025-05-07T20:32:39.7763383Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.7763841Z 
2025-05-07T20:32:39.7764259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.7764785Z 
2025-05-07T20:32:39.7764891Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.7772941Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.7773454Z     T=16384,
2025-05-07T20:32:39.7773661Z     D=5120,
2025-05-07T20:32:39.7773856Z     scale_ub=1200.0,
2025-05-07T20:32:39.7774092Z     contiguous=True,
2025-05-07T20:32:39.7774321Z     compiled=True,
2025-05-07T20:32:39.7774527Z )
2025-05-07T20:32:39.7774864Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.7775385Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:39.7775671Z 
2025-05-07T20:32:39.7775758Z     @given(
2025-05-07T20:32:39.7775995Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.7776323Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.7776636Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.7776970Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.7777306Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.7777691Z     )
2025-05-07T20:32:39.7778044Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.7778498Z     def test_silu_mul_quant(
2025-05-07T20:32:39.7778752Z         self,
2025-05-07T20:32:39.7778946Z         T: int,
2025-05-07T20:32:39.7779151Z         D: int,
2025-05-07T20:32:39.7779377Z         scale_ub: Optional[float],
2025-05-07T20:32:39.7779658Z         contiguous: bool,
2025-05-07T20:32:39.7779895Z         compiled: bool,
2025-05-07T20:32:39.7780124Z     ) -> None:
2025-05-07T20:32:39.7780348Z         torch.manual_seed(2025)
2025-05-07T20:32:39.7780588Z     
2025-05-07T20:32:39.7780869Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.7781221Z     
2025-05-07T20:32:39.7781411Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.7781706Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.7782022Z         x = x_sign * x_clamp
2025-05-07T20:32:39.7782260Z         x0 = x[:, :D]
2025-05-07T20:32:39.7782489Z         x1 = x[:, D:]
2025-05-07T20:32:39.7782704Z     
2025-05-07T20:32:39.7782885Z         if contiguous:
2025-05-07T20:32:39.7783120Z             x0 = x0.contiguous()
2025-05-07T20:32:39.7783388Z             x1 = x1.contiguous()
2025-05-07T20:32:39.7783624Z     
2025-05-07T20:32:39.7783819Z         if scale_ub is not None:
2025-05-07T20:32:39.7784099Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.7784435Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.7784753Z             )
2025-05-07T20:32:39.7784953Z         else:
2025-05-07T20:32:39.7785164Z             scale_ub_tensor = None
2025-05-07T20:32:39.7785414Z     
2025-05-07T20:32:39.7785648Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.7785972Z             op = silu_mul_quant
2025-05-07T20:32:39.7786222Z             if compiled:
2025-05-07T20:32:39.7786474Z                 op = torch.compile(op)
2025-05-07T20:32:39.7786781Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.7787055Z     
2025-05-07T20:32:39.7787251Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.7787421Z 
2025-05-07T20:32:39.7787531Z moe/activation_test.py:117: 
2025-05-07T20:32:39.7787824Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.7788164Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.7788505Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.7789087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:39.7789650Z     return fn(*args, **kwargs)
2025-05-07T20:32:39.7790434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.7791136Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.7791677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.7792413Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.7793084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.7793623Z     kernel = self.compile(
2025-05-07T20:32:39.7794164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.7794837Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.7795242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.7795472Z 
2025-05-07T20:32:39.7795691Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8450b20>
2025-05-07T20:32:39.7796836Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.7798261Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8351550>}
2025-05-07T20:32:39.7799647Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.7800704Z context = <triton._C.libtriton.ir.context object at 0x7f9fc836ef70>
2025-05-07T20:32:39.7800998Z 
2025-05-07T20:32:39.7801184Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.7801717Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.7802196Z                            module_map=module_map)
2025-05-07T20:32:39.7802577Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.7802942Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.7803210Z E       ^
2025-05-07T20:32:39.7803689Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.7804455Z 
2025-05-07T20:32:39.7804910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.7805462Z 
2025-05-07T20:32:40.0030331Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.0030796Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.0031196Z     T=16384,
2025-05-07T20:32:40.0031407Z     D=5120,
2025-05-07T20:32:40.0031598Z     scale_ub=None,
2025-05-07T20:32:40.0031805Z     contiguous=False,
2025-05-07T20:32:40.0032025Z     compiled=True,
2025-05-07T20:32:40.0032228Z )
2025-05-07T20:32:40.0032539Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.0033055Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:40.0033339Z 
2025-05-07T20:32:40.0033425Z     @given(
2025-05-07T20:32:40.0033647Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.0033963Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.0034270Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.0034833Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.0035163Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.0035450Z     )
2025-05-07T20:32:40.0035803Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.0036324Z     def test_silu_mul_quant(
2025-05-07T20:32:40.0036568Z         self,
2025-05-07T20:32:40.0036762Z         T: int,
2025-05-07T20:32:40.0036956Z         D: int,
2025-05-07T20:32:40.0037174Z         scale_ub: Optional[float],
2025-05-07T20:32:40.0037445Z         contiguous: bool,
2025-05-07T20:32:40.0037678Z         compiled: bool,
2025-05-07T20:32:40.0037986Z     ) -> None:
2025-05-07T20:32:40.0038201Z         torch.manual_seed(2025)
2025-05-07T20:32:40.0038434Z     
2025-05-07T20:32:40.0038701Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.0039043Z     
2025-05-07T20:32:40.0039232Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.0039512Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.0039821Z         x = x_sign * x_clamp
2025-05-07T20:32:40.0040060Z         x0 = x[:, :D]
2025-05-07T20:32:40.0040267Z         x1 = x[:, D:]
2025-05-07T20:32:40.0040468Z     
2025-05-07T20:32:40.0040645Z         if contiguous:
2025-05-07T20:32:40.0040872Z             x0 = x0.contiguous()
2025-05-07T20:32:40.0041127Z             x1 = x1.contiguous()
2025-05-07T20:32:40.0041368Z     
2025-05-07T20:32:40.0041550Z         if scale_ub is not None:
2025-05-07T20:32:40.0041817Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.0042237Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.0042544Z             )
2025-05-07T20:32:40.0042729Z         else:
2025-05-07T20:32:40.0042933Z             scale_ub_tensor = None
2025-05-07T20:32:40.0043176Z     
2025-05-07T20:32:40.0043401Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.0043716Z             op = silu_mul_quant
2025-05-07T20:32:40.0043962Z             if compiled:
2025-05-07T20:32:40.0044211Z                 op = torch.compile(op)
2025-05-07T20:32:40.0044504Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.0044775Z     
2025-05-07T20:32:40.0044959Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.0045151Z 
2025-05-07T20:32:40.0045268Z moe/activation_test.py:117: 
2025-05-07T20:32:40.0045571Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.0045894Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.0046177Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.0046747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.0047296Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.0047962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.0048649Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.0049185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.0049856Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.0050520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.0051054Z     kernel = self.compile(
2025-05-07T20:32:40.0051600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.0052251Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.0052654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.0052881Z 
2025-05-07T20:32:40.0053102Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8415b20>
2025-05-07T20:32:40.0054191Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.0055734Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc85361f0>}
2025-05-07T20:32:40.0057093Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.0058170Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8510fb0>
2025-05-07T20:32:40.0058456Z 
2025-05-07T20:32:40.0058625Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.0059140Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.0059604Z                            module_map=module_map)
2025-05-07T20:32:40.0059971Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.0060324Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.0060579Z E       ^
2025-05-07T20:32:40.0061049Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.0061498Z 
2025-05-07T20:32:40.0061919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.0062428Z 
2025-05-07T20:32:40.0062602Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.0063016Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.0063416Z     T=2048,
2025-05-07T20:32:40.0063605Z     D=5120,
2025-05-07T20:32:40.0063784Z     scale_ub=None,
2025-05-07T20:32:40.0063994Z     contiguous=False,
2025-05-07T20:32:40.0064219Z     compiled=True,
2025-05-07T20:32:40.0064417Z )
2025-05-07T20:32:40.1276302Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.1276964Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:40.1277356Z 
2025-05-07T20:32:40.1277458Z     @given(
2025-05-07T20:32:40.1277783Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.1278213Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.1278560Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.1278892Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.1279247Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.1279544Z     )
2025-05-07T20:32:40.1279897Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.1280345Z     def test_silu_mul_quant(
2025-05-07T20:32:40.1280587Z         self,
2025-05-07T20:32:40.1280792Z         T: int,
2025-05-07T20:32:40.1280996Z         D: int,
2025-05-07T20:32:40.1281217Z         scale_ub: Optional[float],
2025-05-07T20:32:40.1281501Z         contiguous: bool,
2025-05-07T20:32:40.1281743Z         compiled: bool,
2025-05-07T20:32:40.1281969Z     ) -> None:
2025-05-07T20:32:40.1282192Z         torch.manual_seed(2025)
2025-05-07T20:32:40.1282445Z     
2025-05-07T20:32:40.1282722Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.1283066Z     
2025-05-07T20:32:40.1283262Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.1283553Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.1283898Z         x = x_sign * x_clamp
2025-05-07T20:32:40.1284141Z         x0 = x[:, :D]
2025-05-07T20:32:40.1284364Z         x1 = x[:, D:]
2025-05-07T20:32:40.1284570Z     
2025-05-07T20:32:40.1284749Z         if contiguous:
2025-05-07T20:32:40.1284983Z             x0 = x0.contiguous()
2025-05-07T20:32:40.1285243Z             x1 = x1.contiguous()
2025-05-07T20:32:40.1285482Z     
2025-05-07T20:32:40.1285680Z         if scale_ub is not None:
2025-05-07T20:32:40.1286200Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.1286534Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.1286848Z             )
2025-05-07T20:32:40.1287042Z         else:
2025-05-07T20:32:40.1287364Z             scale_ub_tensor = None
2025-05-07T20:32:40.1287621Z     
2025-05-07T20:32:40.1287855Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.1288174Z             op = silu_mul_quant
2025-05-07T20:32:40.1288420Z             if compiled:
2025-05-07T20:32:40.1288670Z                 op = torch.compile(op)
2025-05-07T20:32:40.1289044Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.1289317Z     
2025-05-07T20:32:40.1289511Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.1289676Z 
2025-05-07T20:32:40.1289784Z moe/activation_test.py:117: 
2025-05-07T20:32:40.1290077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.1290413Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.1290695Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.1291252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.1291825Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.1292493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.1293177Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.1293785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.1294479Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.1295148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.1295695Z     kernel = self.compile(
2025-05-07T20:32:40.1296234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.1296897Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.1297304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.1297533Z 
2025-05-07T20:32:40.1297747Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8533ca0>
2025-05-07T20:32:40.1298852Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.1300256Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8536f70>}
2025-05-07T20:32:40.1301623Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.1302658Z context = <triton._C.libtriton.ir.context object at 0x7f9fc81ed170>
2025-05-07T20:32:40.1302952Z 
2025-05-07T20:32:40.1303123Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.1303653Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.1304455Z                            module_map=module_map)
2025-05-07T20:32:40.1304832Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.1305194Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.1305463Z E       ^
2025-05-07T20:32:40.1305937Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.1306388Z 
2025-05-07T20:32:40.1306802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.1307406Z 
2025-05-07T20:32:40.1307507Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.1307933Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.1308400Z     T=2048,
2025-05-07T20:32:40.1308586Z     D=5120,
2025-05-07T20:32:40.1308782Z     scale_ub=1200.0,
2025-05-07T20:32:40.1309006Z     contiguous=False,
2025-05-07T20:32:40.1309227Z     compiled=True,
2025-05-07T20:32:40.1309432Z )
2025-05-07T20:32:40.1309750Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.1310404Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:40.1310686Z 
2025-05-07T20:32:40.1310762Z     @given(
2025-05-07T20:32:40.1310996Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.1311310Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.1311624Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.1311961Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.1312291Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.1312575Z     )
2025-05-07T20:32:40.1312928Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.1313373Z     def test_silu_mul_quant(
2025-05-07T20:32:40.1313610Z         self,
2025-05-07T20:32:40.1313807Z         T: int,
2025-05-07T20:32:40.1314009Z         D: int,
2025-05-07T20:32:40.1314227Z         scale_ub: Optional[float],
2025-05-07T20:32:40.1314500Z         contiguous: bool,
2025-05-07T20:32:40.1314812Z         compiled: bool,
2025-05-07T20:32:40.1315032Z     ) -> None:
2025-05-07T20:32:40.1315253Z         torch.manual_seed(2025)
2025-05-07T20:32:40.1315505Z     
2025-05-07T20:32:40.1315777Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.1316124Z     
2025-05-07T20:32:40.1316324Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.1316617Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.1316931Z         x = x_sign * x_clamp
2025-05-07T20:32:40.1317179Z         x0 = x[:, :D]
2025-05-07T20:32:40.1317406Z         x1 = x[:, D:]
2025-05-07T20:32:40.1317612Z     
2025-05-07T20:32:40.1317811Z         if contiguous:
2025-05-07T20:32:40.1318046Z             x0 = x0.contiguous()
2025-05-07T20:32:40.1318305Z             x1 = x1.contiguous()
2025-05-07T20:32:40.1318560Z     
2025-05-07T20:32:40.1318760Z         if scale_ub is not None:
2025-05-07T20:32:40.1319028Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.1319373Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.1319684Z             )
2025-05-07T20:32:40.1319871Z         else:
2025-05-07T20:32:40.1320088Z             scale_ub_tensor = None
2025-05-07T20:32:40.1320349Z     
2025-05-07T20:32:40.1320582Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.1320903Z             op = silu_mul_quant
2025-05-07T20:32:40.1321162Z             if compiled:
2025-05-07T20:32:40.1321407Z                 op = torch.compile(op)
2025-05-07T20:32:40.1321716Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.1322000Z     
2025-05-07T20:32:40.1322202Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.1322370Z 
2025-05-07T20:32:40.1322468Z moe/activation_test.py:117: 
2025-05-07T20:32:40.1322767Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.1323098Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.1323377Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.1323934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.1324491Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.1325149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.1325893Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.1326428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.1327151Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.1327808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.1328335Z     kernel = self.compile(
2025-05-07T20:32:40.1328876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.1329570Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.1329961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.1330199Z 
2025-05-07T20:32:40.1330407Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8201c70>
2025-05-07T20:32:40.1331501Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.1332899Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc81f7940>}
2025-05-07T20:32:40.1334292Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.1335334Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8320fb0>
2025-05-07T20:32:40.1335632Z 
2025-05-07T20:32:40.1335799Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.1336327Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.1336798Z                            module_map=module_map)
2025-05-07T20:32:40.1337169Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.1337532Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.1337788Z E       ^
2025-05-07T20:32:40.1338263Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.1338731Z 
2025-05-07T20:32:40.1339159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.1339678Z 
2025-05-07T20:32:40.5313201Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.5313696Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.5314112Z     T=4096,
2025-05-07T20:32:40.5314298Z     D=5120,
2025-05-07T20:32:40.5314485Z     scale_ub=1200.0,
2025-05-07T20:32:40.5314709Z     contiguous=True,
2025-05-07T20:32:40.5314935Z     compiled=True,
2025-05-07T20:32:40.5315151Z )
2025-05-07T20:32:40.5315509Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.5316003Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:40.5316275Z 
2025-05-07T20:32:40.5316363Z     @given(
2025-05-07T20:32:40.5316583Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.5316893Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.5317198Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.5317528Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.5317860Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.5318145Z     )
2025-05-07T20:32:40.5318485Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.5318924Z     def test_silu_mul_quant(
2025-05-07T20:32:40.5319162Z         self,
2025-05-07T20:32:40.5319353Z         T: int,
2025-05-07T20:32:40.5319783Z         D: int,
2025-05-07T20:32:40.5320001Z         scale_ub: Optional[float],
2025-05-07T20:32:40.5320270Z         contiguous: bool,
2025-05-07T20:32:40.5320502Z         compiled: bool,
2025-05-07T20:32:40.5320725Z     ) -> None:
2025-05-07T20:32:40.5321021Z         torch.manual_seed(2025)
2025-05-07T20:32:40.5321253Z     
2025-05-07T20:32:40.5321520Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.5321859Z     
2025-05-07T20:32:40.5322045Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.5322333Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.5322715Z         x = x_sign * x_clamp
2025-05-07T20:32:40.5330152Z         x0 = x[:, :D]
2025-05-07T20:32:40.5330422Z         x1 = x[:, D:]
2025-05-07T20:32:40.5330644Z     
2025-05-07T20:32:40.5330846Z         if contiguous:
2025-05-07T20:32:40.5331081Z             x0 = x0.contiguous()
2025-05-07T20:32:40.5331371Z             x1 = x1.contiguous()
2025-05-07T20:32:40.5331628Z     
2025-05-07T20:32:40.5331827Z         if scale_ub is not None:
2025-05-07T20:32:40.5332102Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.5332450Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.5332773Z             )
2025-05-07T20:32:40.5332970Z         else:
2025-05-07T20:32:40.5333188Z             scale_ub_tensor = None
2025-05-07T20:32:40.5333446Z     
2025-05-07T20:32:40.5333679Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.5334014Z             op = silu_mul_quant
2025-05-07T20:32:40.5334274Z             if compiled:
2025-05-07T20:32:40.5334654Z                 op = torch.compile(op)
2025-05-07T20:32:40.5334958Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.5335247Z     
2025-05-07T20:32:40.5335443Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.5335611Z 
2025-05-07T20:32:40.5335714Z moe/activation_test.py:117: 
2025-05-07T20:32:40.5336023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.5336365Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.5336646Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.5337222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.5337797Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.5338470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.5339168Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.5339723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.5340415Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.5341086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.5341629Z     kernel = self.compile(
2025-05-07T20:32:40.5342181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.5342847Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.5343251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.5343494Z 
2025-05-07T20:32:40.5343704Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc83429a0>
2025-05-07T20:32:40.5344813Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.5346294Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc82b4790>}
2025-05-07T20:32:40.5347664Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.5348759Z context = <triton._C.libtriton.ir.context object at 0x7f9fc81043f0>
2025-05-07T20:32:40.5349100Z 
2025-05-07T20:32:40.5349269Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.5349801Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.5350394Z                            module_map=module_map)
2025-05-07T20:32:40.5350814Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.5351175Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.5351437Z E       ^
2025-05-07T20:32:40.5351906Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.5352378Z 
2025-05-07T20:32:40.5352806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.5353327Z 
2025-05-07T20:32:40.5353438Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.5353858Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.5354285Z     T=128,
2025-05-07T20:32:40.5354478Z     D=5120,
2025-05-07T20:32:40.5354676Z     scale_ub=1200.0,
2025-05-07T20:32:40.5354900Z     contiguous=False,
2025-05-07T20:32:40.5355132Z     compiled=True,
2025-05-07T20:32:40.5355342Z )
2025-05-07T20:32:40.6681398Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.6682110Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:40.6682387Z 
2025-05-07T20:32:40.6682466Z     @given(
2025-05-07T20:32:40.6682702Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.6683018Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.6683341Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.6683667Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.6683997Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.6684284Z     )
2025-05-07T20:32:40.6684631Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.6685074Z     def test_silu_mul_quant(
2025-05-07T20:32:40.6685322Z         self,
2025-05-07T20:32:40.6685545Z         T: int,
2025-05-07T20:32:40.6685776Z         D: int,
2025-05-07T20:32:40.6686002Z         scale_ub: Optional[float],
2025-05-07T20:32:40.6686277Z         contiguous: bool,
2025-05-07T20:32:40.6686525Z         compiled: bool,
2025-05-07T20:32:40.6686754Z     ) -> None:
2025-05-07T20:32:40.6686966Z         torch.manual_seed(2025)
2025-05-07T20:32:40.6687209Z     
2025-05-07T20:32:40.6687482Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.6687821Z     
2025-05-07T20:32:40.6688015Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.6688303Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.6688611Z         x = x_sign * x_clamp
2025-05-07T20:32:40.6688844Z         x0 = x[:, :D]
2025-05-07T20:32:40.6689061Z         x1 = x[:, D:]
2025-05-07T20:32:40.6689268Z     
2025-05-07T20:32:40.6689447Z         if contiguous:
2025-05-07T20:32:40.6689678Z             x0 = x0.contiguous()
2025-05-07T20:32:40.6689943Z             x1 = x1.contiguous()
2025-05-07T20:32:40.6690176Z     
2025-05-07T20:32:40.6690368Z         if scale_ub is not None:
2025-05-07T20:32:40.6690649Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.6690978Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.6691291Z             )
2025-05-07T20:32:40.6691486Z         else:
2025-05-07T20:32:40.6691689Z             scale_ub_tensor = None
2025-05-07T20:32:40.6691943Z     
2025-05-07T20:32:40.6692176Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.6692578Z             op = silu_mul_quant
2025-05-07T20:32:40.6692827Z             if compiled:
2025-05-07T20:32:40.6693076Z                 op = torch.compile(op)
2025-05-07T20:32:40.6693375Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.6693727Z     
2025-05-07T20:32:40.6693919Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.6694083Z 
2025-05-07T20:32:40.6694188Z moe/activation_test.py:117: 
2025-05-07T20:32:40.6694480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.6694820Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.6695179Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.6695734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.6696307Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.6696979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.6697684Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.6698225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.6698923Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.6699592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.6700124Z     kernel = self.compile(
2025-05-07T20:32:40.6700722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.6701385Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.6701787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.6702019Z 
2025-05-07T20:32:40.6702225Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc82d2280>
2025-05-07T20:32:40.6703330Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.6705119Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc81370d0>}
2025-05-07T20:32:40.6706827Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.6707994Z context = <triton._C.libtriton.ir.context object at 0x7f9fc811fb30>
2025-05-07T20:32:40.6708284Z 
2025-05-07T20:32:40.6708455Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.6708987Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.6709470Z                            module_map=module_map)
2025-05-07T20:32:40.6709886Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.6710247Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.6710518Z E       ^
2025-05-07T20:32:40.6710987Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.6711439Z 
2025-05-07T20:32:40.6711860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.6712380Z 
2025-05-07T20:32:40.6712484Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.6712898Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.6713313Z     T=16384,
2025-05-07T20:32:40.6713499Z     D=7168,
2025-05-07T20:32:40.6713700Z     scale_ub=1200.0,
2025-05-07T20:32:40.6714003Z     contiguous=True,
2025-05-07T20:32:40.6714223Z     compiled=True,
2025-05-07T20:32:40.6714433Z )
2025-05-07T20:32:40.6714753Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.6715432Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:40.6715785Z 
2025-05-07T20:32:40.6715881Z     @given(
2025-05-07T20:32:40.6716171Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.6716553Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.6716934Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.6717365Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.6717699Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.6717982Z     )
2025-05-07T20:32:40.6718332Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.6718776Z     def test_silu_mul_quant(
2025-05-07T20:32:40.6719022Z         self,
2025-05-07T20:32:40.6719221Z         T: int,
2025-05-07T20:32:40.6719423Z         D: int,
2025-05-07T20:32:40.6719638Z         scale_ub: Optional[float],
2025-05-07T20:32:40.6719909Z         contiguous: bool,
2025-05-07T20:32:40.6720158Z         compiled: bool,
2025-05-07T20:32:40.6720382Z     ) -> None:
2025-05-07T20:32:40.6720609Z         torch.manual_seed(2025)
2025-05-07T20:32:40.6720852Z     
2025-05-07T20:32:40.6721123Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.6721477Z     
2025-05-07T20:32:40.6721676Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.6722033Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.6722352Z         x = x_sign * x_clamp
2025-05-07T20:32:40.6722605Z         x0 = x[:, :D]
2025-05-07T20:32:40.6722830Z         x1 = x[:, D:]
2025-05-07T20:32:40.6723041Z     
2025-05-07T20:32:40.6723233Z         if contiguous:
2025-05-07T20:32:40.6723476Z             x0 = x0.contiguous()
2025-05-07T20:32:40.6723740Z             x1 = x1.contiguous()
2025-05-07T20:32:40.6723984Z     
2025-05-07T20:32:40.6724180Z         if scale_ub is not None:
2025-05-07T20:32:40.6724455Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.6724801Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.6725124Z             )
2025-05-07T20:32:40.6725314Z         else:
2025-05-07T20:32:40.6725534Z             scale_ub_tensor = None
2025-05-07T20:32:40.6725789Z     
2025-05-07T20:32:40.6726016Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.6726336Z             op = silu_mul_quant
2025-05-07T20:32:40.6726600Z             if compiled:
2025-05-07T20:32:40.6726847Z                 op = torch.compile(op)
2025-05-07T20:32:40.6727151Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.6727431Z     
2025-05-07T20:32:40.6727619Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.6727794Z 
2025-05-07T20:32:40.6727895Z moe/activation_test.py:117: 
2025-05-07T20:32:40.6728197Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.6728537Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.6728827Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.6729389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.6729952Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.6730616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.6731314Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.6731847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.6732530Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.6733192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.6733775Z     kernel = self.compile(
2025-05-07T20:32:40.6734316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.6735045Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.6735494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.6735722Z 
2025-05-07T20:32:40.6735932Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8080a60>
2025-05-07T20:32:40.6737026Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.6738461Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8137d30>}
2025-05-07T20:32:40.6739823Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.6740861Z context = <triton._C.libtriton.ir.context object at 0x7f9fc7f56170>
2025-05-07T20:32:40.6741153Z 
2025-05-07T20:32:40.6741318Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.6741856Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.6742376Z                            module_map=module_map)
2025-05-07T20:32:40.6742738Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.6743096Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.6743353Z E       ^
2025-05-07T20:32:40.6743822Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.6744278Z 
2025-05-07T20:32:40.6744694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.6745212Z 
2025-05-07T20:32:40.9495372Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.9496054Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.9496611Z     T=16384,
2025-05-07T20:32:40.9496834Z     D=5120,
2025-05-07T20:32:40.9497029Z     scale_ub=1200.0,
2025-05-07T20:32:40.9497254Z     contiguous=True,
2025-05-07T20:32:40.9497473Z     compiled=False,
2025-05-07T20:32:40.9497692Z )
2025-05-07T20:32:40.9498011Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.9498502Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:40.9498786Z 
2025-05-07T20:32:40.9498871Z     @given(
2025-05-07T20:32:40.9499100Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.9499415Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.9499722Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.9500051Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.9500386Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.9500673Z     )
2025-05-07T20:32:40.9501020Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.9501464Z     def test_silu_mul_quant(
2025-05-07T20:32:40.9501708Z         self,
2025-05-07T20:32:40.9501899Z         T: int,
2025-05-07T20:32:40.9502100Z         D: int,
2025-05-07T20:32:40.9502314Z         scale_ub: Optional[float],
2025-05-07T20:32:40.9502587Z         contiguous: bool,
2025-05-07T20:32:40.9502823Z         compiled: bool,
2025-05-07T20:32:40.9503047Z     ) -> None:
2025-05-07T20:32:40.9503262Z         torch.manual_seed(2025)
2025-05-07T20:32:40.9503510Z     
2025-05-07T20:32:40.9504169Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.9504737Z     
2025-05-07T20:32:40.9504930Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.9505216Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.9505552Z         x = x_sign * x_clamp
2025-05-07T20:32:40.9505895Z         x0 = x[:, :D]
2025-05-07T20:32:40.9506114Z         x1 = x[:, D:]
2025-05-07T20:32:40.9506312Z     
2025-05-07T20:32:40.9506498Z         if contiguous:
2025-05-07T20:32:40.9506733Z             x0 = x0.contiguous()
2025-05-07T20:32:40.9506988Z             x1 = x1.contiguous()
2025-05-07T20:32:40.9507228Z     
2025-05-07T20:32:40.9507522Z         if scale_ub is not None:
2025-05-07T20:32:40.9507789Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.9508129Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.9508438Z             )
2025-05-07T20:32:40.9508618Z         else:
2025-05-07T20:32:40.9508830Z             scale_ub_tensor = None
2025-05-07T20:32:40.9509080Z     
2025-05-07T20:32:40.9509309Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.9509625Z             op = silu_mul_quant
2025-05-07T20:32:40.9509952Z             if compiled:
2025-05-07T20:32:40.9510193Z                 op = torch.compile(op)
2025-05-07T20:32:40.9510496Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.9510772Z     
2025-05-07T20:32:40.9510965Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.9511127Z 
2025-05-07T20:32:40.9511226Z moe/activation_test.py:117: 
2025-05-07T20:32:40.9511521Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.9511930Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.9512211Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.9513078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.9513776Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.9514310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.9514989Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.9515653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.9516179Z     kernel = self.compile(
2025-05-07T20:32:40.9516712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.9517361Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.9517758Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.9517982Z 
2025-05-07T20:32:40.9518199Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8030160>
2025-05-07T20:32:40.9519272Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.9520668Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc831d700>}
2025-05-07T20:32:40.9522010Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.9523032Z context = <triton._C.libtriton.ir.context object at 0x7f9fc82e2ab0>
2025-05-07T20:32:40.9523319Z 
2025-05-07T20:32:40.9523491Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.9524010Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.9524485Z                            module_map=module_map)
2025-05-07T20:32:40.9524909Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.9525282Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.9525565Z E       ^
2025-05-07T20:32:40.9526446Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.9526897Z 
2025-05-07T20:32:40.9527320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.9527829Z 
2025-05-07T20:32:40.9527931Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.9528393Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.9528796Z     T=1,
2025-05-07T20:32:40.9528977Z     D=7168,
2025-05-07T20:32:40.9529173Z     scale_ub=1200.0,
2025-05-07T20:32:40.9529400Z     contiguous=False,
2025-05-07T20:32:40.9529622Z     compiled=False,
2025-05-07T20:32:40.9529827Z )
2025-05-07T20:32:40.9530147Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.9530630Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:40.9530902Z 
2025-05-07T20:32:40.9530974Z     @given(
2025-05-07T20:32:40.9531208Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.9531516Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.9531816Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.9532152Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.9532480Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.9532814Z     )
2025-05-07T20:32:40.9533161Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.9533599Z     def test_silu_mul_quant(
2025-05-07T20:32:40.9533831Z         self,
2025-05-07T20:32:40.9534022Z         T: int,
2025-05-07T20:32:40.9534220Z         D: int,
2025-05-07T20:32:40.9534432Z         scale_ub: Optional[float],
2025-05-07T20:32:40.9534707Z         contiguous: bool,
2025-05-07T20:32:40.9534943Z         compiled: bool,
2025-05-07T20:32:40.9535168Z     ) -> None:
2025-05-07T20:32:40.9535377Z         torch.manual_seed(2025)
2025-05-07T20:32:40.9535646Z     
2025-05-07T20:32:40.9535950Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.9536283Z     
2025-05-07T20:32:40.9536475Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.9536765Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.9537072Z         x = x_sign * x_clamp
2025-05-07T20:32:40.9537318Z         x0 = x[:, :D]
2025-05-07T20:32:40.9537535Z         x1 = x[:, D:]
2025-05-07T20:32:40.9537733Z     
2025-05-07T20:32:40.9537917Z         if contiguous:
2025-05-07T20:32:40.9538146Z             x0 = x0.contiguous()
2025-05-07T20:32:40.9538400Z             x1 = x1.contiguous()
2025-05-07T20:32:40.9538642Z     
2025-05-07T20:32:40.9538835Z         if scale_ub is not None:
2025-05-07T20:32:40.9539109Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.9539448Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.9539762Z             )
2025-05-07T20:32:40.9539957Z         else:
2025-05-07T20:32:40.9540167Z             scale_ub_tensor = None
2025-05-07T20:32:40.9540418Z     
2025-05-07T20:32:40.9540648Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.9540957Z             op = silu_mul_quant
2025-05-07T20:32:40.9541210Z             if compiled:
2025-05-07T20:32:40.9541460Z                 op = torch.compile(op)
2025-05-07T20:32:40.9541755Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.9542035Z     
2025-05-07T20:32:40.9542227Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.9542389Z 
2025-05-07T20:32:40.9542488Z moe/activation_test.py:117: 
2025-05-07T20:32:40.9542784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.9543115Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.9543460Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.9544140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.9544868Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.9545412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.9546091Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.9546748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.9547319Z     kernel = self.compile(
2025-05-07T20:32:40.9547857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.9548500Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.9548898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.9549128Z 
2025-05-07T20:32:40.9549336Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8307d90>
2025-05-07T20:32:40.9550480Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.9551899Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc82540d0>}
2025-05-07T20:32:40.9553259Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.9554287Z context = <triton._C.libtriton.ir.context object at 0x7f9fc82539f0>
2025-05-07T20:32:40.9554577Z 
2025-05-07T20:32:40.9554754Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.9555279Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.9555741Z                            module_map=module_map)
2025-05-07T20:32:40.9563328Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.9563736Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.9564011Z E       ^
2025-05-07T20:32:40.9564493Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.9564967Z 
2025-05-07T20:32:40.9565404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.9565932Z 
2025-05-07T20:32:40.9566039Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.9566468Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.9566879Z     T=4096,
2025-05-07T20:32:40.9567075Z     D=7168,
2025-05-07T20:32:40.9567277Z     scale_ub=1200.0,
2025-05-07T20:32:40.9567505Z     contiguous=False,
2025-05-07T20:32:40.9567738Z     compiled=True,
2025-05-07T20:32:40.9567956Z )
2025-05-07T20:32:41.0731313Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.0732093Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:41.0732469Z 
2025-05-07T20:32:41.0732573Z     @given(
2025-05-07T20:32:41.0732909Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.0733333Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.0733737Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.0734160Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.0734543Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.0734834Z     )
2025-05-07T20:32:41.0735431Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.0735886Z     def test_silu_mul_quant(
2025-05-07T20:32:41.0736135Z         self,
2025-05-07T20:32:41.0736329Z         T: int,
2025-05-07T20:32:41.0736532Z         D: int,
2025-05-07T20:32:41.0736847Z         scale_ub: Optional[float],
2025-05-07T20:32:41.0737119Z         contiguous: bool,
2025-05-07T20:32:41.0737368Z         compiled: bool,
2025-05-07T20:32:41.0737601Z     ) -> None:
2025-05-07T20:32:41.0737819Z         torch.manual_seed(2025)
2025-05-07T20:32:41.0738078Z     
2025-05-07T20:32:41.0738440Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.0738790Z     
2025-05-07T20:32:41.0738980Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.0739277Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.0739592Z         x = x_sign * x_clamp
2025-05-07T20:32:41.0739827Z         x0 = x[:, :D]
2025-05-07T20:32:41.0740052Z         x1 = x[:, D:]
2025-05-07T20:32:41.0740270Z     
2025-05-07T20:32:41.0740454Z         if contiguous:
2025-05-07T20:32:41.0740696Z             x0 = x0.contiguous()
2025-05-07T20:32:41.0740964Z             x1 = x1.contiguous()
2025-05-07T20:32:41.0741206Z     
2025-05-07T20:32:41.0741412Z         if scale_ub is not None:
2025-05-07T20:32:41.0741698Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.0742039Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.0742352Z             )
2025-05-07T20:32:41.0742554Z         else:
2025-05-07T20:32:41.0742848Z             scale_ub_tensor = None
2025-05-07T20:32:41.0743115Z     
2025-05-07T20:32:41.0743356Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.0743681Z             op = silu_mul_quant
2025-05-07T20:32:41.0743931Z             if compiled:
2025-05-07T20:32:41.0744188Z                 op = torch.compile(op)
2025-05-07T20:32:41.0744488Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.0744759Z     
2025-05-07T20:32:41.0744958Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.0745126Z 
2025-05-07T20:32:41.0745240Z moe/activation_test.py:117: 
2025-05-07T20:32:41.0745550Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.0745885Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.0746187Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.0746758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.0747316Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.0747996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.0748694Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.0749235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.0750002Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.0750663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.0751195Z     kernel = self.compile(
2025-05-07T20:32:41.0751733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.0752391Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.0752789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.0753025Z 
2025-05-07T20:32:41.0753242Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc7fd1c10>
2025-05-07T20:32:41.0754324Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.0755783Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8254dc0>}
2025-05-07T20:32:41.0757176Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.0758215Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8156570>
2025-05-07T20:32:41.0758501Z 
2025-05-07T20:32:41.0758678Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.0759241Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.0759714Z                            module_map=module_map)
2025-05-07T20:32:41.0760086Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.0760439Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.0760709Z E       ^
2025-05-07T20:32:41.0761180Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.0761632Z 
2025-05-07T20:32:41.0762063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.0762577Z 
2025-05-07T20:32:41.0762685Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.0763111Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.0763520Z     T=128,
2025-05-07T20:32:41.0763748Z     D=7168,
2025-05-07T20:32:41.0763949Z     scale_ub=1200.0,
2025-05-07T20:32:41.0764182Z     contiguous=False,
2025-05-07T20:32:41.0764406Z     compiled=True,
2025-05-07T20:32:41.0764619Z )
2025-05-07T20:32:41.0764947Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.0765442Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:41.0765720Z 
2025-05-07T20:32:41.0765797Z     @given(
2025-05-07T20:32:41.0766034Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.0766351Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.0766662Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.0767004Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.0767334Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.0767618Z     )
2025-05-07T20:32:41.0767977Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.0768424Z     def test_silu_mul_quant(
2025-05-07T20:32:41.0768662Z         self,
2025-05-07T20:32:41.0768866Z         T: int,
2025-05-07T20:32:41.0769072Z         D: int,
2025-05-07T20:32:41.0769308Z         scale_ub: Optional[float],
2025-05-07T20:32:41.0769584Z         contiguous: bool,
2025-05-07T20:32:41.0769833Z         compiled: bool,
2025-05-07T20:32:41.0770072Z     ) -> None:
2025-05-07T20:32:41.0770291Z         torch.manual_seed(2025)
2025-05-07T20:32:41.0770544Z     
2025-05-07T20:32:41.0770824Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.0771168Z     
2025-05-07T20:32:41.0771379Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.0771680Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.0771993Z         x = x_sign * x_clamp
2025-05-07T20:32:41.0772245Z         x0 = x[:, :D]
2025-05-07T20:32:41.0772475Z         x1 = x[:, D:]
2025-05-07T20:32:41.0772690Z     
2025-05-07T20:32:41.0772892Z         if contiguous:
2025-05-07T20:32:41.0773132Z             x0 = x0.contiguous()
2025-05-07T20:32:41.0773398Z             x1 = x1.contiguous()
2025-05-07T20:32:41.0773646Z     
2025-05-07T20:32:41.0773847Z         if scale_ub is not None:
2025-05-07T20:32:41.0774120Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.0774464Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.0774828Z             )
2025-05-07T20:32:41.0775021Z         else:
2025-05-07T20:32:41.0775231Z             scale_ub_tensor = None
2025-05-07T20:32:41.0775489Z     
2025-05-07T20:32:41.0775722Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.0776070Z             op = silu_mul_quant
2025-05-07T20:32:41.0776323Z             if compiled:
2025-05-07T20:32:41.0776575Z                 op = torch.compile(op)
2025-05-07T20:32:41.0776874Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.0777151Z     
2025-05-07T20:32:41.0777352Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.0777564Z 
2025-05-07T20:32:41.0777666Z moe/activation_test.py:117: 
2025-05-07T20:32:41.0777967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.0778299Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.0778585Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.0779136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.0779695Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.0780357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.0781037Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.0781571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.0782253Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.0782959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.0783487Z     kernel = self.compile(
2025-05-07T20:32:41.0784023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.0784680Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.0785077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.0785315Z 
2025-05-07T20:32:41.0785523Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc81532b0>
2025-05-07T20:32:41.0786612Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.0788009Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc806a940>}
2025-05-07T20:32:41.0789363Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.0790452Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8210330>
2025-05-07T20:32:41.0790744Z 
2025-05-07T20:32:41.0790909Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.0791440Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.0791909Z                            module_map=module_map)
2025-05-07T20:32:41.0792272Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.0792629Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.0792893Z E       ^
2025-05-07T20:32:41.0793359Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.0793818Z 
2025-05-07T20:32:41.0794235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.0794754Z 
2025-05-07T20:32:41.2452635Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.2453567Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.2454088Z     T=2048,
2025-05-07T20:32:41.2454283Z     D=7168,
2025-05-07T20:32:41.2454476Z     scale_ub=None,
2025-05-07T20:32:41.2454703Z     contiguous=True,
2025-05-07T20:32:41.2455030Z     compiled=True,
2025-05-07T20:32:41.2455240Z )
2025-05-07T20:32:41.2455561Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.2456058Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:41.2456326Z 
2025-05-07T20:32:41.2456524Z     @given(
2025-05-07T20:32:41.2456762Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.2457078Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.2457390Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.2457721Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.2458054Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.2458359Z     )
2025-05-07T20:32:41.2458709Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.2459149Z     def test_silu_mul_quant(
2025-05-07T20:32:41.2459401Z         self,
2025-05-07T20:32:41.2459607Z         T: int,
2025-05-07T20:32:41.2459811Z         D: int,
2025-05-07T20:32:41.2460040Z         scale_ub: Optional[float],
2025-05-07T20:32:41.2460325Z         contiguous: bool,
2025-05-07T20:32:41.2460564Z         compiled: bool,
2025-05-07T20:32:41.2460795Z     ) -> None:
2025-05-07T20:32:41.2461026Z         torch.manual_seed(2025)
2025-05-07T20:32:41.2461360Z     
2025-05-07T20:32:41.2461634Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.2461985Z     
2025-05-07T20:32:41.2462185Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.2462482Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.2462796Z         x = x_sign * x_clamp
2025-05-07T20:32:41.2463042Z         x0 = x[:, :D]
2025-05-07T20:32:41.2463259Z         x1 = x[:, D:]
2025-05-07T20:32:41.2463472Z     
2025-05-07T20:32:41.2463664Z         if contiguous:
2025-05-07T20:32:41.2463896Z             x0 = x0.contiguous()
2025-05-07T20:32:41.2464168Z             x1 = x1.contiguous()
2025-05-07T20:32:41.2464418Z     
2025-05-07T20:32:41.2464608Z         if scale_ub is not None:
2025-05-07T20:32:41.2464892Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.2465232Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.2465589Z             )
2025-05-07T20:32:41.2465798Z         else:
2025-05-07T20:32:41.2466018Z             scale_ub_tensor = None
2025-05-07T20:32:41.2466271Z     
2025-05-07T20:32:41.2466506Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.2466825Z             op = silu_mul_quant
2025-05-07T20:32:41.2467080Z             if compiled:
2025-05-07T20:32:41.2467327Z                 op = torch.compile(op)
2025-05-07T20:32:41.2467631Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.2467911Z     
2025-05-07T20:32:41.2468102Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.2468271Z 
2025-05-07T20:32:41.2468378Z moe/activation_test.py:117: 
2025-05-07T20:32:41.2468690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.2469018Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.2469309Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.2469973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.2470537Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.2471195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.2471884Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.2472421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.2473154Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.2473813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.2474388Z     kernel = self.compile(
2025-05-07T20:32:41.2474932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.2475608Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.2476035Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.2476301Z 
2025-05-07T20:32:41.2476515Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8239af0>
2025-05-07T20:32:41.2477605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.2478991Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc7fc9550>}
2025-05-07T20:32:41.2480335Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.2481358Z context = <triton._C.libtriton.ir.context object at 0x7f9fc7e7f830>
2025-05-07T20:32:41.2481650Z 
2025-05-07T20:32:41.2481884Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.2482407Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.2482873Z                            module_map=module_map)
2025-05-07T20:32:41.2483243Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.2483605Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.2483855Z E       ^
2025-05-07T20:32:41.2484321Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.2484770Z 
2025-05-07T20:32:41.2485205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.2485713Z 
2025-05-07T20:32:41.2485826Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.2486236Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.2486650Z     T=16384,
2025-05-07T20:32:41.2486845Z     D=5120,
2025-05-07T20:32:41.2487032Z     scale_ub=None,
2025-05-07T20:32:41.2487244Z     contiguous=False,
2025-05-07T20:32:41.2487465Z     compiled=False,
2025-05-07T20:32:41.2487657Z )
2025-05-07T20:32:41.2487974Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.2488478Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:41.2488753Z 
2025-05-07T20:32:41.2488828Z     @given(
2025-05-07T20:32:41.2489058Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.2489368Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.2489672Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.2489992Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.2490320Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.2490606Z     )
2025-05-07T20:32:41.2490948Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.2491390Z     def test_silu_mul_quant(
2025-05-07T20:32:41.2491633Z         self,
2025-05-07T20:32:41.2491818Z         T: int,
2025-05-07T20:32:41.2492018Z         D: int,
2025-05-07T20:32:41.2492237Z         scale_ub: Optional[float],
2025-05-07T20:32:41.2492500Z         contiguous: bool,
2025-05-07T20:32:41.2492791Z         compiled: bool,
2025-05-07T20:32:41.2493011Z     ) -> None:
2025-05-07T20:32:41.2493219Z         torch.manual_seed(2025)
2025-05-07T20:32:41.2493462Z     
2025-05-07T20:32:41.2493732Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.2494106Z     
2025-05-07T20:32:41.2494298Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.2494584Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.2496650Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.2498574Z 
2025-05-07T20:32:41.2498696Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:41.2498906Z 
2025-05-07T20:32:41.2499015Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.2499421Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.2499830Z     T=4096,
2025-05-07T20:32:41.2500022Z     D=7168,
2025-05-07T20:32:41.2500211Z     scale_ub=1200.0,
2025-05-07T20:32:41.2500434Z     contiguous=True,
2025-05-07T20:32:41.2500656Z     compiled=True,
2025-05-07T20:32:41.2500851Z )
2025-05-07T20:32:41.2501214Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.2501703Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:41.2501969Z 
2025-05-07T20:32:41.2502052Z     @given(
2025-05-07T20:32:41.2502270Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.2502574Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.2502887Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.2503207Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.2503535Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.2504181Z     )
2025-05-07T20:32:41.2504525Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.2504961Z     def test_silu_mul_quant(
2025-05-07T20:32:41.2505198Z         self,
2025-05-07T20:32:41.2505385Z         T: int,
2025-05-07T20:32:41.2505571Z         D: int,
2025-05-07T20:32:41.2505792Z         scale_ub: Optional[float],
2025-05-07T20:32:41.2506055Z         contiguous: bool,
2025-05-07T20:32:41.2506284Z         compiled: bool,
2025-05-07T20:32:41.2506501Z     ) -> None:
2025-05-07T20:32:41.2506709Z         torch.manual_seed(2025)
2025-05-07T20:32:41.2506941Z     
2025-05-07T20:32:41.2507212Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.2507552Z     
2025-05-07T20:32:41.2507737Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.2508022Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.2510083Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.2511950Z 
2025-05-07T20:32:41.2512072Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:41.2512279Z 
2025-05-07T20:32:41.2512387Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.2512789Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.2513281Z     T=16384,
2025-05-07T20:32:41.2513477Z     D=7168,
2025-05-07T20:32:41.2513661Z     scale_ub=None,
2025-05-07T20:32:41.2513872Z     contiguous=False,
2025-05-07T20:32:41.2514098Z     compiled=False,
2025-05-07T20:32:41.2514350Z )
2025-05-07T20:32:41.3559572Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.3560252Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:41.3560538Z 
2025-05-07T20:32:41.3560616Z     @given(
2025-05-07T20:32:41.3561124Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.3561436Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.3561742Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.3562073Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.3562401Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.3562691Z     )
2025-05-07T20:32:41.3563039Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.3563474Z     def test_silu_mul_quant(
2025-05-07T20:32:41.3563718Z         self,
2025-05-07T20:32:41.3563914Z         T: int,
2025-05-07T20:32:41.3564109Z         D: int,
2025-05-07T20:32:41.3564328Z         scale_ub: Optional[float],
2025-05-07T20:32:41.3564603Z         contiguous: bool,
2025-05-07T20:32:41.3564834Z         compiled: bool,
2025-05-07T20:32:41.3565061Z     ) -> None:
2025-05-07T20:32:41.3565275Z         torch.manual_seed(2025)
2025-05-07T20:32:41.3565529Z     
2025-05-07T20:32:41.3565922Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.3568010Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.3569935Z 
2025-05-07T20:32:41.3570058Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.3570273Z 
2025-05-07T20:32:41.3570380Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.3570782Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.3571192Z     T=2048,
2025-05-07T20:32:41.3571382Z     D=7168,
2025-05-07T20:32:41.3571572Z     scale_ub=1200.0,
2025-05-07T20:32:41.3571789Z     contiguous=True,
2025-05-07T20:32:41.3572014Z     compiled=True,
2025-05-07T20:32:41.3572219Z )
2025-05-07T20:32:41.3572530Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.3573018Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:41.3573287Z 
2025-05-07T20:32:41.3573375Z     @given(
2025-05-07T20:32:41.3573596Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.3573903Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.3574213Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.3574533Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.3574864Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.3575151Z     )
2025-05-07T20:32:41.3575509Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.3575945Z     def test_silu_mul_quant(
2025-05-07T20:32:41.3576187Z         self,
2025-05-07T20:32:41.3576382Z         T: int,
2025-05-07T20:32:41.3576575Z         D: int,
2025-05-07T20:32:41.3576797Z         scale_ub: Optional[float],
2025-05-07T20:32:41.3584704Z         contiguous: bool,
2025-05-07T20:32:41.3584970Z         compiled: bool,
2025-05-07T20:32:41.3585326Z     ) -> None:
2025-05-07T20:32:41.3585590Z         torch.manual_seed(2025)
2025-05-07T20:32:41.3585850Z     
2025-05-07T20:32:41.3586135Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.3586481Z     
2025-05-07T20:32:41.3586757Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.3587058Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.3589101Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.3591179Z 
2025-05-07T20:32:41.3591303Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:41.3591527Z 
2025-05-07T20:32:41.3591630Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.3592051Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.3592453Z     T=2048,
2025-05-07T20:32:41.3592629Z     D=7168,
2025-05-07T20:32:41.3592818Z     scale_ub=None,
2025-05-07T20:32:41.3593039Z     contiguous=True,
2025-05-07T20:32:41.3593265Z     compiled=False,
2025-05-07T20:32:41.3593482Z )
2025-05-07T20:32:41.3593855Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.3594356Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:41.3594640Z 
2025-05-07T20:32:41.3594719Z     @given(
2025-05-07T20:32:41.3594953Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.3595262Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.3595584Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.3595969Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.3596303Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.3596589Z     )
2025-05-07T20:32:41.3596946Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.3597391Z     def test_silu_mul_quant(
2025-05-07T20:32:41.3597631Z         self,
2025-05-07T20:32:41.3597827Z         T: int,
2025-05-07T20:32:41.3598033Z         D: int,
2025-05-07T20:32:41.3598247Z         scale_ub: Optional[float],
2025-05-07T20:32:41.3598527Z         contiguous: bool,
2025-05-07T20:32:41.3598771Z         compiled: bool,
2025-05-07T20:32:41.3598991Z     ) -> None:
2025-05-07T20:32:41.3599209Z         torch.manual_seed(2025)
2025-05-07T20:32:41.3599453Z     
2025-05-07T20:32:41.3599719Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.3600066Z     
2025-05-07T20:32:41.3600265Z >       x_sign = torch.sign(x)
2025-05-07T20:32:41.3602250Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.3604529Z 
2025-05-07T20:32:41.3604657Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:41.3604870Z 
2025-05-07T20:32:41.3604972Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.3605396Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.3605798Z     T=1,
2025-05-07T20:32:41.3606069Z     D=7168,
2025-05-07T20:32:41.3606271Z     scale_ub=1200.0,
2025-05-07T20:32:41.3606502Z     contiguous=True,
2025-05-07T20:32:41.3606721Z     compiled=False,
2025-05-07T20:32:41.3606930Z )
2025-05-07T20:32:41.6888388Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.6889137Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:41.6889440Z 
2025-05-07T20:32:41.6889523Z     @given(
2025-05-07T20:32:41.6889761Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.6890074Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.6890499Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.6890838Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.6891168Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.6891465Z     )
2025-05-07T20:32:41.6891824Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.6892273Z     def test_silu_mul_quant(
2025-05-07T20:32:41.6892526Z         self,
2025-05-07T20:32:41.6892730Z         T: int,
2025-05-07T20:32:41.6892928Z         D: int,
2025-05-07T20:32:41.6893158Z         scale_ub: Optional[float],
2025-05-07T20:32:41.6893444Z         contiguous: bool,
2025-05-07T20:32:41.6893687Z         compiled: bool,
2025-05-07T20:32:41.6893915Z     ) -> None:
2025-05-07T20:32:41.6894137Z         torch.manual_seed(2025)
2025-05-07T20:32:41.6894387Z     
2025-05-07T20:32:41.6894658Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.6895007Z     
2025-05-07T20:32:41.6895301Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.6895595Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.6895910Z         x = x_sign * x_clamp
2025-05-07T20:32:41.6896155Z         x0 = x[:, :D]
2025-05-07T20:32:41.6896371Z         x1 = x[:, D:]
2025-05-07T20:32:41.6896581Z     
2025-05-07T20:32:41.6896771Z         if contiguous:
2025-05-07T20:32:41.6897005Z             x0 = x0.contiguous()
2025-05-07T20:32:41.6897268Z             x1 = x1.contiguous()
2025-05-07T20:32:41.6897510Z     
2025-05-07T20:32:41.6897698Z         if scale_ub is not None:
2025-05-07T20:32:41.6897978Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.6898325Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.6898637Z             )
2025-05-07T20:32:41.6898830Z         else:
2025-05-07T20:32:41.6899047Z             scale_ub_tensor = None
2025-05-07T20:32:41.6899303Z     
2025-05-07T20:32:41.6899539Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.6899867Z             op = silu_mul_quant
2025-05-07T20:32:41.6900131Z             if compiled:
2025-05-07T20:32:41.6900376Z                 op = torch.compile(op)
2025-05-07T20:32:41.6900679Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.6900959Z     
2025-05-07T20:32:41.6901148Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.6901324Z 
2025-05-07T20:32:41.6901428Z moe/activation_test.py:117: 
2025-05-07T20:32:41.6901731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.6902064Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.6902356Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.6903053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.6904019Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.6904573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.6905270Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.6905942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.6906480Z     kernel = self.compile(
2025-05-07T20:32:41.6907021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.6907772Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.6908239Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.6908470Z 
2025-05-07T20:32:41.6908679Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc7c5c0a0>
2025-05-07T20:32:41.6909780Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.6911349Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc818f040>}
2025-05-07T20:32:41.6912705Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.6913741Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8191e30>
2025-05-07T20:32:41.6914034Z 
2025-05-07T20:32:41.6914209Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.6914749Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.6915222Z                            module_map=module_map)
2025-05-07T20:32:41.6915652Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.6916017Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.6916288Z E       ^
2025-05-07T20:32:41.6916762Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.6917215Z 
2025-05-07T20:32:41.6917632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.6918161Z 
2025-05-07T20:32:41.6918270Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.6918694Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.6919110Z     T=128,
2025-05-07T20:32:41.6919295Z     D=5120,
2025-05-07T20:32:41.6919489Z     scale_ub=None,
2025-05-07T20:32:41.6919709Z     contiguous=True,
2025-05-07T20:32:41.6919930Z     compiled=False,
2025-05-07T20:32:41.6920144Z )
2025-05-07T20:32:41.6920467Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.6920961Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:41.6921237Z 
2025-05-07T20:32:41.6921315Z     @given(
2025-05-07T20:32:41.6921549Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.6921860Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.6922173Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.6922513Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.6922846Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.6923130Z     )
2025-05-07T20:32:41.6923491Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.6923933Z     def test_silu_mul_quant(
2025-05-07T20:32:41.6924174Z         self,
2025-05-07T20:32:41.6924372Z         T: int,
2025-05-07T20:32:41.6924575Z         D: int,
2025-05-07T20:32:41.6924795Z         scale_ub: Optional[float],
2025-05-07T20:32:41.6925073Z         contiguous: bool,
2025-05-07T20:32:41.6925324Z         compiled: bool,
2025-05-07T20:32:41.6925547Z     ) -> None:
2025-05-07T20:32:41.6925768Z         torch.manual_seed(2025)
2025-05-07T20:32:41.6926017Z     
2025-05-07T20:32:41.6926288Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.6926803Z     
2025-05-07T20:32:41.6927005Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.6927350Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.6927665Z         x = x_sign * x_clamp
2025-05-07T20:32:41.6927920Z         x0 = x[:, :D]
2025-05-07T20:32:41.6928145Z         x1 = x[:, D:]
2025-05-07T20:32:41.6928351Z     
2025-05-07T20:32:41.6928617Z         if contiguous:
2025-05-07T20:32:41.6928857Z             x0 = x0.contiguous()
2025-05-07T20:32:41.6929118Z             x1 = x1.contiguous()
2025-05-07T20:32:41.6929366Z     
2025-05-07T20:32:41.6929564Z         if scale_ub is not None:
2025-05-07T20:32:41.6929838Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.6930221Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.6930538Z             )
2025-05-07T20:32:41.6930737Z         else:
2025-05-07T20:32:41.6930958Z             scale_ub_tensor = None
2025-05-07T20:32:41.6931219Z     
2025-05-07T20:32:41.6931453Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.6931774Z             op = silu_mul_quant
2025-05-07T20:32:41.6932034Z             if compiled:
2025-05-07T20:32:41.6932279Z                 op = torch.compile(op)
2025-05-07T20:32:41.6932581Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.6932883Z     
2025-05-07T20:32:41.6933083Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.6933257Z 
2025-05-07T20:32:41.6933360Z moe/activation_test.py:117: 
2025-05-07T20:32:41.6933666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.6933997Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.6934285Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.6935035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.6935734Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.6936278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.6936969Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.6937644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.6938176Z     kernel = self.compile(
2025-05-07T20:32:41.6938721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.6939392Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.6939801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.6940033Z 
2025-05-07T20:32:41.6940241Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc81bf2b0>
2025-05-07T20:32:41.6941331Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.6942719Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc818fa60>}
2025-05-07T20:32:41.6944085Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.6945115Z context = <triton._C.libtriton.ir.context object at 0x7f9fc7dea130>
2025-05-07T20:32:41.6945414Z 
2025-05-07T20:32:41.6945588Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.6946119Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.6946592Z                            module_map=module_map)
2025-05-07T20:32:41.6946958Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.6947369Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.6947633Z E       ^
2025-05-07T20:32:41.6948096Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.6948556Z 
2025-05-07T20:32:41.6949019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.6949542Z 
2025-05-07T20:32:41.6949645Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.6950114Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.6950567Z     T=128,
2025-05-07T20:32:41.6950760Z     D=7168,
2025-05-07T20:32:41.6950959Z     scale_ub=None,
2025-05-07T20:32:41.6951177Z     contiguous=True,
2025-05-07T20:32:41.6951408Z     compiled=False,
2025-05-07T20:32:41.6951613Z )
2025-05-07T20:32:41.7849064Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.7849836Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:41.7850223Z 
2025-05-07T20:32:41.7850331Z     @given(
2025-05-07T20:32:41.7850589Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.7850902Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.7851214Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.7851550Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.7851881Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.7852162Z     )
2025-05-07T20:32:41.7852732Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.7853185Z     def test_silu_mul_quant(
2025-05-07T20:32:41.7853431Z         self,
2025-05-07T20:32:41.7853623Z         T: int,
2025-05-07T20:32:41.7853820Z         D: int,
2025-05-07T20:32:41.7854049Z         scale_ub: Optional[float],
2025-05-07T20:32:41.7854318Z         contiguous: bool,
2025-05-07T20:32:41.7854564Z         compiled: bool,
2025-05-07T20:32:41.7854794Z     ) -> None:
2025-05-07T20:32:41.7855000Z         torch.manual_seed(2025)
2025-05-07T20:32:41.7855239Z     
2025-05-07T20:32:41.7855537Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.7855896Z     
2025-05-07T20:32:41.7856092Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.7856382Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.7856685Z         x = x_sign * x_clamp
2025-05-07T20:32:41.7856923Z         x0 = x[:, :D]
2025-05-07T20:32:41.7857138Z         x1 = x[:, D:]
2025-05-07T20:32:41.7857334Z     
2025-05-07T20:32:41.7857521Z         if contiguous:
2025-05-07T20:32:41.7857753Z             x0 = x0.contiguous()
2025-05-07T20:32:41.7858005Z             x1 = x1.contiguous()
2025-05-07T20:32:41.7858245Z     
2025-05-07T20:32:41.7858430Z         if scale_ub is not None:
2025-05-07T20:32:41.7858700Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.7859031Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.7859341Z             )
2025-05-07T20:32:41.7859528Z         else:
2025-05-07T20:32:41.7859735Z             scale_ub_tensor = None
2025-05-07T20:32:41.7859981Z     
2025-05-07T20:32:41.7860206Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.7860512Z             op = silu_mul_quant
2025-05-07T20:32:41.7860757Z             if compiled:
2025-05-07T20:32:41.7861001Z                 op = torch.compile(op)
2025-05-07T20:32:41.7861289Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.7861561Z     
2025-05-07T20:32:41.7861750Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.7861914Z 
2025-05-07T20:32:41.7862014Z moe/activation_test.py:117: 
2025-05-07T20:32:41.7862304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.7862628Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.7862907Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.7863594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.7864371Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.7864995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.7865679Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.7866380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.7866909Z     kernel = self.compile(
2025-05-07T20:32:41.7867519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.7868176Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.7868569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.7868797Z 
2025-05-07T20:32:41.7869016Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc81c3580>
2025-05-07T20:32:41.7870174Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.7871554Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc7d85790>}
2025-05-07T20:32:41.7872948Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.7873974Z context = <triton._C.libtriton.ir.context object at 0x7f9fc7d6d5b0>
2025-05-07T20:32:41.7874265Z 
2025-05-07T20:32:41.7874435Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.7874962Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.7875431Z                            module_map=module_map)
2025-05-07T20:32:41.7875849Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.7876200Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.7876448Z E       ^
2025-05-07T20:32:41.7876908Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.7877361Z 
2025-05-07T20:32:41.7877788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.7878303Z 
2025-05-07T20:32:41.7878414Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.7878821Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.7879228Z     T=2048,
2025-05-07T20:32:41.7879417Z     D=7168,
2025-05-07T20:32:41.7879596Z     scale_ub=1200.0,
2025-05-07T20:32:41.7879818Z     contiguous=True,
2025-05-07T20:32:41.7880040Z     compiled=False,
2025-05-07T20:32:41.7880242Z )
2025-05-07T20:32:41.7880557Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.7881056Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:41.7881331Z 
2025-05-07T20:32:41.7881405Z     @given(
2025-05-07T20:32:41.7881633Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.7881946Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.7882251Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.7882572Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.7882901Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.7883181Z     )
2025-05-07T20:32:41.7883525Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.7884019Z     def test_silu_mul_quant(
2025-05-07T20:32:41.7884254Z         self,
2025-05-07T20:32:41.7884438Z         T: int,
2025-05-07T20:32:41.7884632Z         D: int,
2025-05-07T20:32:41.7884845Z         scale_ub: Optional[float],
2025-05-07T20:32:41.7885111Z         contiguous: bool,
2025-05-07T20:32:41.7885392Z         compiled: bool,
2025-05-07T20:32:41.7885615Z     ) -> None:
2025-05-07T20:32:41.7885825Z         torch.manual_seed(2025)
2025-05-07T20:32:41.7886069Z     
2025-05-07T20:32:41.7886335Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.7888398Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.7890320Z 
2025-05-07T20:32:41.7890440Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.7890652Z 
2025-05-07T20:32:41.7890762Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.7891172Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.7891570Z     T=1,
2025-05-07T20:32:41.7891746Z     D=5120,
2025-05-07T20:32:41.7891933Z     scale_ub=1200.0,
2025-05-07T20:32:41.7892198Z     contiguous=True,
2025-05-07T20:32:41.7892426Z     compiled=False,
2025-05-07T20:32:41.7892617Z )
2025-05-07T20:32:41.8375922Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.8376691Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:41.8377057Z 
2025-05-07T20:32:41.8377168Z     @given(
2025-05-07T20:32:41.8377420Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.8377737Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.8378047Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.8378382Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.8378715Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.8378999Z     )
2025-05-07T20:32:41.8379355Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.8379796Z     def test_silu_mul_quant(
2025-05-07T20:32:41.8380030Z         self,
2025-05-07T20:32:41.8380234Z         T: int,
2025-05-07T20:32:41.8380431Z         D: int,
2025-05-07T20:32:41.8380642Z         scale_ub: Optional[float],
2025-05-07T20:32:41.8380915Z         contiguous: bool,
2025-05-07T20:32:41.8381153Z         compiled: bool,
2025-05-07T20:32:41.8381372Z     ) -> None:
2025-05-07T20:32:41.8381587Z         torch.manual_seed(2025)
2025-05-07T20:32:41.8381833Z     
2025-05-07T20:32:41.8382110Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.8382443Z     
2025-05-07T20:32:41.8382638Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.8382931Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.8383237Z         x = x_sign * x_clamp
2025-05-07T20:32:41.8383479Z         x0 = x[:, :D]
2025-05-07T20:32:41.8383696Z         x1 = x[:, D:]
2025-05-07T20:32:41.8383904Z     
2025-05-07T20:32:41.8384090Z         if contiguous:
2025-05-07T20:32:41.8384324Z             x0 = x0.contiguous()
2025-05-07T20:32:41.8384585Z             x1 = x1.contiguous()
2025-05-07T20:32:41.8384830Z     
2025-05-07T20:32:41.8385024Z         if scale_ub is not None:
2025-05-07T20:32:41.8385298Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.8385689Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.8386004Z             )
2025-05-07T20:32:41.8386191Z         else:
2025-05-07T20:32:41.8386628Z             scale_ub_tensor = None
2025-05-07T20:32:41.8386898Z     
2025-05-07T20:32:41.8395419Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.8395763Z             op = silu_mul_quant
2025-05-07T20:32:41.8396069Z             if compiled:
2025-05-07T20:32:41.8396474Z                 op = torch.compile(op)
2025-05-07T20:32:41.8396783Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.8397057Z     
2025-05-07T20:32:41.8397258Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.8397426Z 
2025-05-07T20:32:41.8397536Z moe/activation_test.py:117: 
2025-05-07T20:32:41.8397919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.8398253Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.8398545Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.8399247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.8399951Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.8400498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.8401194Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.8401865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.8402398Z     kernel = self.compile(
2025-05-07T20:32:41.8402954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.8403996Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.8404404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.8404639Z 
2025-05-07T20:32:41.8404847Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc81a7520>
2025-05-07T20:32:41.8405990Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.8407393Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc7ea3040>}
2025-05-07T20:32:41.8408756Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.8409784Z context = <triton._C.libtriton.ir.context object at 0x7f9fc7e994b0>
2025-05-07T20:32:41.8410087Z 
2025-05-07T20:32:41.8410254Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.8410786Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.8411259Z                            module_map=module_map)
2025-05-07T20:32:41.8411621Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.8411984Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.8412247Z E       ^
2025-05-07T20:32:41.8412708Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.8413176Z 
2025-05-07T20:32:41.8413594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.8414120Z 
2025-05-07T20:32:41.8414229Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.8414656Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.8415064Z     T=2048,
2025-05-07T20:32:41.8415266Z     D=5120,
2025-05-07T20:32:41.8415485Z     scale_ub=None,
2025-05-07T20:32:41.8415738Z     contiguous=True,
2025-05-07T20:32:41.8416046Z     compiled=False,
2025-05-07T20:32:41.8416267Z )
2025-05-07T20:32:41.8416586Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.8417098Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:41.8417371Z 
2025-05-07T20:32:41.8417525Z     @given(
2025-05-07T20:32:41.8417764Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.8418088Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.8418416Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.8418744Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.8419134Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.8419438Z     )
2025-05-07T20:32:41.8419792Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.8420231Z     def test_silu_mul_quant(
2025-05-07T20:32:41.8420486Z         self,
2025-05-07T20:32:41.8420691Z         T: int,
2025-05-07T20:32:41.8420892Z         D: int,
2025-05-07T20:32:41.8421123Z         scale_ub: Optional[float],
2025-05-07T20:32:41.8421394Z         contiguous: bool,
2025-05-07T20:32:41.8421643Z         compiled: bool,
2025-05-07T20:32:41.8421867Z     ) -> None:
2025-05-07T20:32:41.8422087Z         torch.manual_seed(2025)
2025-05-07T20:32:41.8422321Z     
2025-05-07T20:32:41.8422597Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.8422953Z     
2025-05-07T20:32:41.8423145Z >       x_sign = torch.sign(x)
2025-05-07T20:32:41.8425189Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.8427124Z 
2025-05-07T20:32:41.8427245Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:41.8427463Z 
2025-05-07T20:32:41.8427581Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.8428007Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.8428420Z     T=16384,
2025-05-07T20:32:41.8428617Z     D=5120,
2025-05-07T20:32:41.8428811Z     scale_ub=None,
2025-05-07T20:32:41.8429020Z     contiguous=True,
2025-05-07T20:32:41.8429259Z     compiled=False,
2025-05-07T20:32:41.8429472Z )
2025-05-07T20:32:41.8429787Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.8430351Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:41.8430626Z 
2025-05-07T20:32:41.8430708Z     @given(
2025-05-07T20:32:41.8430931Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.8431250Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.8431554Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.8431878Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.8432209Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.8432494Z     )
2025-05-07T20:32:41.8432845Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.8433280Z     def test_silu_mul_quant(
2025-05-07T20:32:41.8433525Z         self,
2025-05-07T20:32:41.8433727Z         T: int,
2025-05-07T20:32:41.8433924Z         D: int,
2025-05-07T20:32:41.8434140Z         scale_ub: Optional[float],
2025-05-07T20:32:41.8434421Z         contiguous: bool,
2025-05-07T20:32:41.8434658Z         compiled: bool,
2025-05-07T20:32:41.8434885Z     ) -> None:
2025-05-07T20:32:41.8435101Z         torch.manual_seed(2025)
2025-05-07T20:32:41.8435348Z     
2025-05-07T20:32:41.8435671Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.8437755Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.8439662Z 
2025-05-07T20:32:41.8439780Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.8439992Z 
2025-05-07T20:32:41.8440103Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.8440511Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.8440918Z     T=4096,
2025-05-07T20:32:41.8441120Z     D=5120,
2025-05-07T20:32:41.8441309Z     scale_ub=None,
2025-05-07T20:32:41.8441529Z     contiguous=True,
2025-05-07T20:32:41.8441754Z     compiled=False,
2025-05-07T20:32:41.8441958Z )
2025-05-07T20:32:41.9458776Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.9459549Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:41.9459934Z 
2025-05-07T20:32:41.9460036Z     @given(
2025-05-07T20:32:41.9460281Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.9460745Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.9461077Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.9461423Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.9461762Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.9462056Z     )
2025-05-07T20:32:41.9462418Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.9462877Z     def test_silu_mul_quant(
2025-05-07T20:32:41.9463124Z         self,
2025-05-07T20:32:41.9463328Z         T: int,
2025-05-07T20:32:41.9463535Z         D: int,
2025-05-07T20:32:41.9463756Z         scale_ub: Optional[float],
2025-05-07T20:32:41.9464042Z         contiguous: bool,
2025-05-07T20:32:41.9464291Z         compiled: bool,
2025-05-07T20:32:41.9464523Z     ) -> None:
2025-05-07T20:32:41.9464750Z         torch.manual_seed(2025)
2025-05-07T20:32:41.9465001Z     
2025-05-07T20:32:41.9465282Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.9467430Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.9469331Z 
2025-05-07T20:32:41.9469457Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.9469686Z 
2025-05-07T20:32:41.9469793Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.9470312Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.9470714Z     T=2048,
2025-05-07T20:32:41.9470911Z     D=5120,
2025-05-07T20:32:41.9471107Z     scale_ub=None,
2025-05-07T20:32:41.9471328Z     contiguous=False,
2025-05-07T20:32:41.9471563Z     compiled=False,
2025-05-07T20:32:41.9471777Z )
2025-05-07T20:32:41.9472098Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.9472597Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:41.9472879Z 
2025-05-07T20:32:41.9473039Z     @given(
2025-05-07T20:32:41.9473275Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.9473587Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.9473906Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.9474314Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.9474649Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.9474949Z     )
2025-05-07T20:32:41.9475307Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.9475772Z     def test_silu_mul_quant(
2025-05-07T20:32:41.9476124Z         self,
2025-05-07T20:32:41.9476325Z         T: int,
2025-05-07T20:32:41.9476530Z         D: int,
2025-05-07T20:32:41.9476749Z         scale_ub: Optional[float],
2025-05-07T20:32:41.9477031Z         contiguous: bool,
2025-05-07T20:32:41.9477280Z         compiled: bool,
2025-05-07T20:32:41.9477505Z     ) -> None:
2025-05-07T20:32:41.9477731Z         torch.manual_seed(2025)
2025-05-07T20:32:41.9477991Z     
2025-05-07T20:32:41.9478262Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.9480416Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.9482305Z 
2025-05-07T20:32:41.9482428Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.9482648Z 
2025-05-07T20:32:41.9482755Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.9483185Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.9483591Z     T=4096,
2025-05-07T20:32:41.9483792Z     D=7168,
2025-05-07T20:32:41.9484000Z     scale_ub=None,
2025-05-07T20:32:41.9484216Z     contiguous=True,
2025-05-07T20:32:41.9484447Z     compiled=True,
2025-05-07T20:32:41.9484659Z )
2025-05-07T20:32:41.9484984Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.9485479Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:41.9485747Z 
2025-05-07T20:32:41.9485834Z     @given(
2025-05-07T20:32:41.9486088Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.9486410Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.9486728Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.9487060Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.9487406Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.9487703Z     )
2025-05-07T20:32:41.9488058Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.9488507Z     def test_silu_mul_quant(
2025-05-07T20:32:41.9488760Z         self,
2025-05-07T20:32:41.9488963Z         T: int,
2025-05-07T20:32:41.9489174Z         D: int,
2025-05-07T20:32:41.9489411Z         scale_ub: Optional[float],
2025-05-07T20:32:41.9489684Z         contiguous: bool,
2025-05-07T20:32:41.9489935Z         compiled: bool,
2025-05-07T20:32:41.9490179Z     ) -> None:
2025-05-07T20:32:41.9490401Z         torch.manual_seed(2025)
2025-05-07T20:32:41.9490657Z     
2025-05-07T20:32:41.9490943Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.9493020Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.9495007Z 
2025-05-07T20:32:41.9495136Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.9495351Z 
2025-05-07T20:32:41.9495456Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.9495881Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.9496288Z     T=2048,
2025-05-07T20:32:41.9496524Z     D=5120,
2025-05-07T20:32:41.9496722Z     scale_ub=1200.0,
2025-05-07T20:32:41.9496954Z     contiguous=False,
2025-05-07T20:32:41.9497184Z     compiled=False,
2025-05-07T20:32:41.9497397Z )
2025-05-07T20:32:41.9497719Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.9498218Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:41.9498503Z 
2025-05-07T20:32:41.9498585Z     @given(
2025-05-07T20:32:41.9498823Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.9499144Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.9499458Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.9499803Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.9500145Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.9500434Z     )
2025-05-07T20:32:41.9500789Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.9501290Z     def test_silu_mul_quant(
2025-05-07T20:32:41.9501551Z         self,
2025-05-07T20:32:41.9501746Z         T: int,
2025-05-07T20:32:41.9501950Z         D: int,
2025-05-07T20:32:41.9502180Z         scale_ub: Optional[float],
2025-05-07T20:32:41.9502458Z         contiguous: bool,
2025-05-07T20:32:41.9502711Z         compiled: bool,
2025-05-07T20:32:41.9502947Z     ) -> None:
2025-05-07T20:32:41.9503169Z         torch.manual_seed(2025)
2025-05-07T20:32:41.9503429Z     
2025-05-07T20:32:41.9503971Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.9506115Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.9507975Z 
2025-05-07T20:32:41.9508098Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.9508324Z 
2025-05-07T20:32:41.9508431Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.9508862Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.9509282Z     T=4096,
2025-05-07T20:32:41.9509473Z     D=7168,
2025-05-07T20:32:41.9509675Z     scale_ub=1200.0,
2025-05-07T20:32:41.9509971Z     contiguous=True,
2025-05-07T20:32:41.9510198Z     compiled=False,
2025-05-07T20:32:41.9510412Z )
2025-05-07T20:32:41.9510734Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.9511224Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:41.9511510Z 
2025-05-07T20:32:41.9511595Z     @given(
2025-05-07T20:32:41.9511834Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.9512150Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.9512472Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.9512815Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.9513155Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.9513520Z     )
2025-05-07T20:32:41.9513875Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.9514331Z     def test_silu_mul_quant(
2025-05-07T20:32:41.9514577Z         self,
2025-05-07T20:32:41.9514849Z         T: int,
2025-05-07T20:32:41.9515063Z         D: int,
2025-05-07T20:32:41.9515283Z         scale_ub: Optional[float],
2025-05-07T20:32:41.9515564Z         contiguous: bool,
2025-05-07T20:32:41.9515811Z         compiled: bool,
2025-05-07T20:32:41.9516037Z     ) -> None:
2025-05-07T20:32:41.9516265Z         torch.manual_seed(2025)
2025-05-07T20:32:41.9516580Z     
2025-05-07T20:32:41.9516856Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.9518936Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.9520834Z 
2025-05-07T20:32:41.9520956Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.9521175Z 
2025-05-07T20:32:41.9521283Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.9521761Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.9522170Z     T=16384,
2025-05-07T20:32:41.9522375Z     D=7168,
2025-05-07T20:32:41.9522578Z     scale_ub=None,
2025-05-07T20:32:41.9522794Z     contiguous=False,
2025-05-07T20:32:41.9523034Z     compiled=True,
2025-05-07T20:32:41.9523247Z )
2025-05-07T20:32:42.0820331Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.0821104Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.0821490Z 
2025-05-07T20:32:42.0821583Z     @given(
2025-05-07T20:32:42.0821819Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.0822150Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.0822464Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.0822796Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.0823138Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.0823432Z     )
2025-05-07T20:32:42.0823793Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.0824246Z     def test_silu_mul_quant(
2025-05-07T20:32:42.0824497Z         self,
2025-05-07T20:32:42.0824695Z         T: int,
2025-05-07T20:32:42.0824903Z         D: int,
2025-05-07T20:32:42.0825133Z         scale_ub: Optional[float],
2025-05-07T20:32:42.0825410Z         contiguous: bool,
2025-05-07T20:32:42.0825711Z         compiled: bool,
2025-05-07T20:32:42.0825948Z     ) -> None:
2025-05-07T20:32:42.0826173Z         torch.manual_seed(2025)
2025-05-07T20:32:42.0826419Z     
2025-05-07T20:32:42.0826702Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.0828784Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.0830757Z 
2025-05-07T20:32:42.0830885Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.0831323Z 
2025-05-07T20:32:42.0831430Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.0831852Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.0832265Z     T=4096,
2025-05-07T20:32:42.0832463Z     D=7168,
2025-05-07T20:32:42.0832729Z     scale_ub=None,
2025-05-07T20:32:42.0832955Z     contiguous=True,
2025-05-07T20:32:42.0833187Z     compiled=False,
2025-05-07T20:32:42.0833401Z )
2025-05-07T20:32:42.0833721Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.0834220Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.0834572Z 
2025-05-07T20:32:42.0834653Z     @given(
2025-05-07T20:32:42.0834889Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.0835209Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.0835522Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.0835857Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.0836205Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.0836501Z     )
2025-05-07T20:32:42.0836850Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.0837300Z     def test_silu_mul_quant(
2025-05-07T20:32:42.0837552Z         self,
2025-05-07T20:32:42.0837751Z         T: int,
2025-05-07T20:32:42.0837957Z         D: int,
2025-05-07T20:32:42.0838188Z         scale_ub: Optional[float],
2025-05-07T20:32:42.0838463Z         contiguous: bool,
2025-05-07T20:32:42.0838710Z         compiled: bool,
2025-05-07T20:32:42.0839066Z     ) -> None:
2025-05-07T20:32:42.0839286Z         torch.manual_seed(2025)
2025-05-07T20:32:42.0839542Z     
2025-05-07T20:32:42.0839828Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.0841914Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.0843805Z 
2025-05-07T20:32:42.0843935Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.0844152Z 
2025-05-07T20:32:42.0844258Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.0844689Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.0845106Z     T=16384,
2025-05-07T20:32:42.0845305Z     D=7168,
2025-05-07T20:32:42.0845512Z     scale_ub=None,
2025-05-07T20:32:42.0845735Z     contiguous=True,
2025-05-07T20:32:42.0845966Z     compiled=False,
2025-05-07T20:32:42.0846183Z )
2025-05-07T20:32:42.0846507Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.0846999Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.0847282Z 
2025-05-07T20:32:42.0847364Z     @given(
2025-05-07T20:32:42.0847603Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.0847921Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.0848229Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.0848563Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.0848910Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.0849199Z     )
2025-05-07T20:32:42.0849555Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.0850005Z     def test_silu_mul_quant(
2025-05-07T20:32:42.0850255Z         self,
2025-05-07T20:32:42.0850460Z         T: int,
2025-05-07T20:32:42.0850668Z         D: int,
2025-05-07T20:32:42.0850946Z         scale_ub: Optional[float],
2025-05-07T20:32:42.0851224Z         contiguous: bool,
2025-05-07T20:32:42.0851474Z         compiled: bool,
2025-05-07T20:32:42.0851704Z     ) -> None:
2025-05-07T20:32:42.0851921Z         torch.manual_seed(2025)
2025-05-07T20:32:42.0852174Z     
2025-05-07T20:32:42.0852495Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.0854561Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.0856534Z 
2025-05-07T20:32:42.0856658Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.0856881Z 
2025-05-07T20:32:42.0856987Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.0857409Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.0857820Z     T=16384,
2025-05-07T20:32:42.0858015Z     D=7168,
2025-05-07T20:32:42.0858214Z     scale_ub=1200.0,
2025-05-07T20:32:42.0858450Z     contiguous=True,
2025-05-07T20:32:42.0858673Z     compiled=False,
2025-05-07T20:32:42.0858883Z )
2025-05-07T20:32:42.0859207Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.0859757Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.0860042Z 
2025-05-07T20:32:42.0860121Z     @given(
2025-05-07T20:32:42.0860362Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.0860691Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.0861010Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.0861356Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.0861701Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.0861990Z     )
2025-05-07T20:32:42.0870210Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.0870672Z     def test_silu_mul_quant(
2025-05-07T20:32:42.0870918Z         self,
2025-05-07T20:32:42.0871126Z         T: int,
2025-05-07T20:32:42.0871331Z         D: int,
2025-05-07T20:32:42.0871549Z         scale_ub: Optional[float],
2025-05-07T20:32:42.0871832Z         contiguous: bool,
2025-05-07T20:32:42.0872080Z         compiled: bool,
2025-05-07T20:32:42.0872305Z     ) -> None:
2025-05-07T20:32:42.0872526Z         torch.manual_seed(2025)
2025-05-07T20:32:42.0872776Z     
2025-05-07T20:32:42.0873049Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.0875151Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.0877058Z 
2025-05-07T20:32:42.0877179Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.0877414Z 
2025-05-07T20:32:42.0877518Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.0877937Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.0878338Z     T=128,
2025-05-07T20:32:42.0878532Z     D=5120,
2025-05-07T20:32:42.0878731Z     scale_ub=1200.0,
2025-05-07T20:32:42.0878954Z     contiguous=False,
2025-05-07T20:32:42.0879273Z     compiled=False,
2025-05-07T20:32:42.0879482Z )
2025-05-07T20:32:42.2500042Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2500783Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.2501432Z 
2025-05-07T20:32:42.2501520Z     @given(
2025-05-07T20:32:42.2501766Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2502081Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2502390Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2502737Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2503172Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2503454Z     )
2025-05-07T20:32:42.2504082Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2504534Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2504773Z         self,
2025-05-07T20:32:42.2504979Z         T: int,
2025-05-07T20:32:42.2505183Z         D: int,
2025-05-07T20:32:42.2505398Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2505676Z         contiguous: bool,
2025-05-07T20:32:42.2505922Z         compiled: bool,
2025-05-07T20:32:42.2506146Z     ) -> None:
2025-05-07T20:32:42.2506371Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2506620Z     
2025-05-07T20:32:42.2506888Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2507237Z     
2025-05-07T20:32:42.2507436Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.2507821Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.2508133Z         x = x_sign * x_clamp
2025-05-07T20:32:42.2508381Z         x0 = x[:, :D]
2025-05-07T20:32:42.2508601Z         x1 = x[:, D:]
2025-05-07T20:32:42.2508804Z     
2025-05-07T20:32:42.2508994Z         if contiguous:
2025-05-07T20:32:42.2509228Z             x0 = x0.contiguous()
2025-05-07T20:32:42.2509485Z             x1 = x1.contiguous()
2025-05-07T20:32:42.2509728Z     
2025-05-07T20:32:42.2510010Z         if scale_ub is not None:
2025-05-07T20:32:42.2510282Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.2510611Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.2510919Z             )
2025-05-07T20:32:42.2511115Z         else:
2025-05-07T20:32:42.2511319Z             scale_ub_tensor = None
2025-05-07T20:32:42.2511568Z     
2025-05-07T20:32:42.2511799Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.2512108Z             op = silu_mul_quant
2025-05-07T20:32:42.2512349Z             if compiled:
2025-05-07T20:32:42.2512602Z                 op = torch.compile(op)
2025-05-07T20:32:42.2512900Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2513169Z     
2025-05-07T20:32:42.2513365Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.2513528Z 
2025-05-07T20:32:42.2513638Z moe/activation_test.py:117: 
2025-05-07T20:32:42.2513924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2514430Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.2514711Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2515402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.2516092Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.2516625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.2517311Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.2517964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.2518490Z     kernel = self.compile(
2025-05-07T20:32:42.2519033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.2519769Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.2520153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2520385Z 
2025-05-07T20:32:42.2520646Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc7bc4190>
2025-05-07T20:32:42.2521729Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.2523194Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc7bc5ca0>}
2025-05-07T20:32:42.2524530Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.2525553Z context = <triton._C.libtriton.ir.context object at 0x7f9fc7b88df0>
2025-05-07T20:32:42.2525848Z 
2025-05-07T20:32:42.2526016Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.2526542Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.2526999Z                            module_map=module_map)
2025-05-07T20:32:42.2527369Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.2527718Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.2527971Z E       ^
2025-05-07T20:32:42.2528478Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.2528937Z 
2025-05-07T20:32:42.2529351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.2529864Z 
2025-05-07T20:32:42.2529981Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2530386Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2530783Z     T=2048,
2025-05-07T20:32:42.2530974Z     D=7168,
2025-05-07T20:32:42.2531158Z     scale_ub=None,
2025-05-07T20:32:42.2531373Z     contiguous=False,
2025-05-07T20:32:42.2531597Z     compiled=False,
2025-05-07T20:32:42.2531808Z )
2025-05-07T20:32:42.2532118Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2532608Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.2532884Z 
2025-05-07T20:32:42.2532966Z     @given(
2025-05-07T20:32:42.2533188Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2533501Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2533810Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2534133Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2534463Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2534747Z     )
2025-05-07T20:32:42.2535096Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2535534Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2535780Z         self,
2025-05-07T20:32:42.2535982Z         T: int,
2025-05-07T20:32:42.2536172Z         D: int,
2025-05-07T20:32:42.2536388Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2536659Z         contiguous: bool,
2025-05-07T20:32:42.2536889Z         compiled: bool,
2025-05-07T20:32:42.2537109Z     ) -> None:
2025-05-07T20:32:42.2537327Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2537563Z     
2025-05-07T20:32:42.2537830Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2539930Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.2541824Z 
2025-05-07T20:32:42.2541947Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.2542156Z 
2025-05-07T20:32:42.2542263Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2542669Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2543107Z     T=128,
2025-05-07T20:32:42.2543288Z     D=7168,
2025-05-07T20:32:42.2543470Z     scale_ub=1200.0,
2025-05-07T20:32:42.2543687Z     contiguous=True,
2025-05-07T20:32:42.2543904Z     compiled=True,
2025-05-07T20:32:42.2544099Z )
2025-05-07T20:32:42.2990476Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2991250Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.2991625Z 
2025-05-07T20:32:42.2991727Z     @given(
2025-05-07T20:32:42.2992033Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2992382Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2992689Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2993017Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2993341Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2993624Z     )
2025-05-07T20:32:42.2994143Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2994586Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2994822Z         self,
2025-05-07T20:32:42.2995020Z         T: int,
2025-05-07T20:32:42.2995216Z         D: int,
2025-05-07T20:32:42.2995432Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2995761Z         contiguous: bool,
2025-05-07T20:32:42.2996003Z         compiled: bool,
2025-05-07T20:32:42.2996227Z     ) -> None:
2025-05-07T20:32:42.2996448Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2996691Z     
2025-05-07T20:32:42.2996961Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2997306Z     
2025-05-07T20:32:42.2997508Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.2997803Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.2998107Z         x = x_sign * x_clamp
2025-05-07T20:32:42.2998354Z         x0 = x[:, :D]
2025-05-07T20:32:42.2998579Z         x1 = x[:, D:]
2025-05-07T20:32:42.2998778Z     
2025-05-07T20:32:42.2998964Z         if contiguous:
2025-05-07T20:32:42.2999194Z             x0 = x0.contiguous()
2025-05-07T20:32:42.2999447Z             x1 = x1.contiguous()
2025-05-07T20:32:42.2999687Z     
2025-05-07T20:32:42.2999876Z         if scale_ub is not None:
2025-05-07T20:32:42.3000142Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.3000477Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.3000784Z             )
2025-05-07T20:32:42.3000969Z         else:
2025-05-07T20:32:42.3001176Z             scale_ub_tensor = None
2025-05-07T20:32:42.3001425Z     
2025-05-07T20:32:42.3001649Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.3001964Z             op = silu_mul_quant
2025-05-07T20:32:42.3002211Z             if compiled:
2025-05-07T20:32:42.3002456Z                 op = torch.compile(op)
2025-05-07T20:32:42.3002756Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.3003031Z     
2025-05-07T20:32:42.3003220Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.3003384Z 
2025-05-07T20:32:42.3003508Z moe/activation_test.py:117: 
2025-05-07T20:32:42.3004078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.3004412Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.3004811Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.3005366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.3005978Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.3006714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.3007404Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.3007933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.3008721Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.3009384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.3009909Z     kernel = self.compile(
2025-05-07T20:32:42.3010452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.3011108Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.3011506Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.3011732Z 
2025-05-07T20:32:42.3011942Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc7ae3850>
2025-05-07T20:32:42.3013091Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.3014492Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc7b390d0>}
2025-05-07T20:32:42.3015842Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.3016867Z context = <triton._C.libtriton.ir.context object at 0x7f9fc7ab9ab0>
2025-05-07T20:32:42.3017150Z 
2025-05-07T20:32:42.3017317Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.3017837Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.3018304Z                            module_map=module_map)
2025-05-07T20:32:42.3018664Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.3019021Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.3019278Z E       ^
2025-05-07T20:32:42.3019740Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.3020188Z 
2025-05-07T20:32:42.3020602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.3021118Z 
2025-05-07T20:32:42.3021219Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.3021627Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.3022029Z     T=128,
2025-05-07T20:32:42.3022208Z     D=7168,
2025-05-07T20:32:42.3022401Z     scale_ub=1200.0,
2025-05-07T20:32:42.3022626Z     contiguous=True,
2025-05-07T20:32:42.3022839Z     compiled=False,
2025-05-07T20:32:42.3023047Z )
2025-05-07T20:32:42.3023360Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.3023841Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.3024118Z 
2025-05-07T20:32:42.3024194Z     @given(
2025-05-07T20:32:42.3024419Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.3024721Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.3025026Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.3025406Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.3025782Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.3026062Z     )
2025-05-07T20:32:42.3026407Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.3026886Z     def test_silu_mul_quant(
2025-05-07T20:32:42.3027120Z         self,
2025-05-07T20:32:42.3027311Z         T: int,
2025-05-07T20:32:42.3027507Z         D: int,
2025-05-07T20:32:42.3027716Z         scale_ub: Optional[float],
2025-05-07T20:32:42.3027985Z         contiguous: bool,
2025-05-07T20:32:42.3028217Z         compiled: bool,
2025-05-07T20:32:42.3028479Z     ) -> None:
2025-05-07T20:32:42.3028692Z         torch.manual_seed(2025)
2025-05-07T20:32:42.3028928Z     
2025-05-07T20:32:42.3029190Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.3029530Z     
2025-05-07T20:32:42.3029717Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.3030092Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.3032095Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.3034003Z 
2025-05-07T20:32:42.3034121Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:42.3034337Z 
2025-05-07T20:32:42.3034436Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.3034849Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.3035247Z     T=128,
2025-05-07T20:32:42.3035437Z     D=5120,
2025-05-07T20:32:42.3035627Z     scale_ub=1200.0,
2025-05-07T20:32:42.3035866Z     contiguous=True,
2025-05-07T20:32:42.3036111Z     compiled=True,
2025-05-07T20:32:42.3036312Z )
2025-05-07T20:32:42.3036615Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.3037099Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.3037369Z 
2025-05-07T20:32:42.3037442Z     @given(
2025-05-07T20:32:42.3037662Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.3037964Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.3038269Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.3038596Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.3038913Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.3039194Z     )
2025-05-07T20:32:42.3039544Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.3039975Z     def test_silu_mul_quant(
2025-05-07T20:32:42.3040237Z         self,
2025-05-07T20:32:42.3040429Z         T: int,
2025-05-07T20:32:42.3040625Z         D: int,
2025-05-07T20:32:42.3040831Z         scale_ub: Optional[float],
2025-05-07T20:32:42.3041100Z         contiguous: bool,
2025-05-07T20:32:42.3041337Z         compiled: bool,
2025-05-07T20:32:42.3041550Z     ) -> None:
2025-05-07T20:32:42.3041761Z         torch.manual_seed(2025)
2025-05-07T20:32:42.3041998Z     
2025-05-07T20:32:42.3042259Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.3042596Z     
2025-05-07T20:32:42.3042792Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.3043072Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.3045109Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.3047014Z 
2025-05-07T20:32:42.3047129Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:42.3047344Z 
2025-05-07T20:32:42.3047443Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.3047863Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.3048311Z     T=128,
2025-05-07T20:32:42.3048495Z     D=7168,
2025-05-07T20:32:42.3048685Z     scale_ub=None,
2025-05-07T20:32:42.3048890Z     contiguous=True,
2025-05-07T20:32:42.3049115Z     compiled=True,
2025-05-07T20:32:42.3049315Z )
2025-05-07T20:32:42.5174942Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5175709Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.5176074Z 
2025-05-07T20:32:42.5176183Z     @given(
2025-05-07T20:32:42.5176493Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5176903Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5177301Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5177728Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5178092Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5178381Z     )
2025-05-07T20:32:42.5178965Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5179426Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5179664Z         self,
2025-05-07T20:32:42.5179862Z         T: int,
2025-05-07T20:32:42.5180061Z         D: int,
2025-05-07T20:32:42.5180275Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5180545Z         contiguous: bool,
2025-05-07T20:32:42.5180789Z         compiled: bool,
2025-05-07T20:32:42.5181011Z     ) -> None:
2025-05-07T20:32:42.5181231Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5181478Z     
2025-05-07T20:32:42.5181743Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5183826Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.5185773Z 
2025-05-07T20:32:42.5185902Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.5186123Z 
2025-05-07T20:32:42.5197123Z FAILED
2025-05-07T20:32:42.5197288Z 
2025-05-07T20:32:42.5197638Z =================================== FAILURES ===================================
2025-05-07T20:32:42.5198240Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:42.5198848Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:42.5199689Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 59, in testPartExecutor
2025-05-07T20:32:42.5200425Z   |     yield
2025-05-07T20:32:42.5201001Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 592, in run
2025-05-07T20:32:42.5201716Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:42.5202497Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 550, in _callTestMethod
2025-05-07T20:32:42.5203219Z   |     method()
2025-05-07T20:32:42.5204302Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:42.5205493Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5206485Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:42.5207340Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:42.5208003Z   | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:42.5208667Z   +-+---------------- 1 ----------------
2025-05-07T20:32:42.5209163Z     | Traceback (most recent call last):
2025-05-07T20:32:42.5210124Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:42.5211192Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5214471Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.5230669Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:42.5231393Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5231942Z     |     T=2048,
2025-05-07T20:32:42.5232260Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:42.5232728Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:42.5233201Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:42.5233704Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:42.5234127Z     | )
2025-05-07T20:32:42.5234370Z     | 
2025-05-07T20:32:42.5235076Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:42.5235917Z     +---------------- 2 ----------------
2025-05-07T20:32:42.5236316Z     | Traceback (most recent call last):
2025-05-07T20:32:42.5237297Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:42.5238360Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5241211Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.5243386Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:42.5243839Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5244246Z     |     T=128,
2025-05-07T20:32:42.5244451Z     |     D=7168,
2025-05-07T20:32:42.5244668Z     |     scale_ub=None,
2025-05-07T20:32:42.5244909Z     |     contiguous=True,
2025-05-07T20:32:42.5245145Z     |     compiled=True,
2025-05-07T20:32:42.5245368Z     | )
2025-05-07T20:32:42.5245551Z     | 
2025-05-07T20:32:42.5246121Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:42.5246790Z     +---------------- 3 ----------------
2025-05-07T20:32:42.5247084Z     | Traceback (most recent call last):
2025-05-07T20:32:42.5247861Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:42.5248643Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5250705Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.5252739Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:42.5253190Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5253592Z     |     T=128,
2025-05-07T20:32:42.5253802Z     |     D=5120,
2025-05-07T20:32:42.5254018Z     |     scale_ub=1200.0,
2025-05-07T20:32:42.5254256Z     |     contiguous=True,
2025-05-07T20:32:42.5254500Z     |     compiled=True,
2025-05-07T20:32:42.5254729Z     | )
2025-05-07T20:32:42.5254904Z     | 
2025-05-07T20:32:42.5255472Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:42.5256126Z     +---------------- 4 ----------------
2025-05-07T20:32:42.5256430Z     | Traceback (most recent call last):
2025-05-07T20:32:42.5257134Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:42.5257855Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.5258513Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:42.5259213Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.5260042Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:42.5260843Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.5261463Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:42.5262198Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5263145Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:42.5264211Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.5265280Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:32:42.5266387Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.5267469Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:42.5268441Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.5269346Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:42.5270240Z     |     fn()
2025-05-07T20:32:42.5271012Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:42.5271954Z     |     self.fn.run(
2025-05-07T20:32:42.5272727Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:42.5273517Z     |     kernel = self.compile(
2025-05-07T20:32:42.5274343Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:42.5275321Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5276382Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:42.5277461Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5278173Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5278659Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.5279011Z     | ^
2025-05-07T20:32:42.5279649Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5280437Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:42.5280997Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:42.5281700Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5282298Z     |     T=1,  # or any other generated value
2025-05-07T20:32:42.5282785Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:42.5283237Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:42.5283726Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:42.5284219Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:42.5284621Z     | )
2025-05-07T20:32:42.5284859Z     | 
2025-05-07T20:32:42.5285586Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:42.5286444Z     +------------------------------------
2025-05-07T20:32:42.5286930Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:42.5287450Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5288016Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5288554Z     T=1,
2025-05-07T20:32:42.5288799Z     D=5120,
2025-05-07T20:32:42.5289055Z     scale_ub=None,
2025-05-07T20:32:42.5289333Z     contiguous=True,
2025-05-07T20:32:42.5289634Z     compiled=True,
2025-05-07T20:32:42.5289917Z )
2025-05-07T20:32:42.5290352Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5291012Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.5291380Z 
2025-05-07T20:32:42.5291485Z     @given(
2025-05-07T20:32:42.5291800Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5292220Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5292642Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5293101Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5293550Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5293957Z     )
2025-05-07T20:32:42.5294435Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5295035Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5295362Z         self,
2025-05-07T20:32:42.5295621Z         T: int,
2025-05-07T20:32:42.5295905Z         D: int,
2025-05-07T20:32:42.5296230Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5296596Z         contiguous: bool,
2025-05-07T20:32:42.5296922Z         compiled: bool,
2025-05-07T20:32:42.5297218Z     ) -> None:
2025-05-07T20:32:42.5297572Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5297903Z     
2025-05-07T20:32:42.5298262Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5298739Z     
2025-05-07T20:32:42.5299003Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5299433Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5299857Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5300179Z         x0 = x[:, :D]
2025-05-07T20:32:42.5300462Z         x1 = x[:, D:]
2025-05-07T20:32:42.5300744Z     
2025-05-07T20:32:42.5300995Z         if contiguous:
2025-05-07T20:32:42.5301297Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5301708Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5302044Z     
2025-05-07T20:32:42.5302309Z         if scale_ub is not None:
2025-05-07T20:32:42.5302681Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5303144Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5303563Z             )
2025-05-07T20:32:42.5304118Z         else:
2025-05-07T20:32:42.5304402Z             scale_ub_tensor = None
2025-05-07T20:32:42.5304743Z     
2025-05-07T20:32:42.5305050Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5305485Z             op = silu_mul_quant
2025-05-07T20:32:42.5305865Z             if compiled:
2025-05-07T20:32:42.5306189Z                 op = torch.compile(op)
2025-05-07T20:32:42.5306592Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5306979Z     
2025-05-07T20:32:42.5307239Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.5307768Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.5308164Z     
2025-05-07T20:32:42.5308478Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5308932Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.5309337Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.5309769Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.5310363Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.5310794Z     
2025-05-07T20:32:42.5311056Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.5311313Z 
2025-05-07T20:32:42.5311437Z moe/activation_test.py:126: 
2025-05-07T20:32:42.5311823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5312269Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.5312703Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.5313773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.5314810Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.5315571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5316569Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5317520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.5318517Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.5319576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.5320608Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.5321623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.5322490Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.5323311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.5323994Z     fn()
2025-05-07T20:32:42.5324674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.5325570Z     self.fn.run(
2025-05-07T20:32:42.5326238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5327039Z     kernel = self.compile(
2025-05-07T20:32:42.5327784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5328664Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5329216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5329630Z 
2025-05-07T20:32:42.5329913Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fcc0b2040>
2025-05-07T20:32:42.5331441Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5333406Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fcc1d69d0>}
2025-05-07T20:32:42.5335296Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5336755Z context = <triton._C.libtriton.ir.context object at 0x7f9fcc644d30>
2025-05-07T20:32:42.5337162Z 
2025-05-07T20:32:42.5337444Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5338172Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5338792Z                            module_map=module_map)
2025-05-07T20:32:42.5339257Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5339719Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.5340062Z E       ^
2025-05-07T20:32:42.5340674Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5341272Z 
2025-05-07T20:32:42.5341820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5342503Z 
2025-05-07T20:32:42.5342635Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5343172Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5343700Z     T=2048,
2025-05-07T20:32:42.5343939Z     D=5120,
2025-05-07T20:32:42.5344191Z     scale_ub=1200.0,
2025-05-07T20:32:42.5344488Z     contiguous=True,
2025-05-07T20:32:42.5344766Z     compiled=False,
2025-05-07T20:32:42.5345029Z )
2025-05-07T20:32:42.5345436Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5346122Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.5346487Z 
2025-05-07T20:32:42.5346587Z     @given(
2025-05-07T20:32:42.5346884Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5347307Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5347722Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5348158Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5348586Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5348970Z     )
2025-05-07T20:32:42.5349436Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5350127Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5350450Z         self,
2025-05-07T20:32:42.5350718Z         T: int,
2025-05-07T20:32:42.5350977Z         D: int,
2025-05-07T20:32:42.5351266Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5351638Z         contiguous: bool,
2025-05-07T20:32:42.5352072Z         compiled: bool,
2025-05-07T20:32:42.5352373Z     ) -> None:
2025-05-07T20:32:42.5352671Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5353004Z     
2025-05-07T20:32:42.5353369Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5353899Z     
2025-05-07T20:32:42.5354167Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5354564Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5354985Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5355313Z         x0 = x[:, :D]
2025-05-07T20:32:42.5355588Z         x1 = x[:, D:]
2025-05-07T20:32:42.5355946Z     
2025-05-07T20:32:42.5356204Z         if contiguous:
2025-05-07T20:32:42.5356502Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5356829Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5357141Z     
2025-05-07T20:32:42.5357388Z         if scale_ub is not None:
2025-05-07T20:32:42.5357737Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5358174Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5358574Z             )
2025-05-07T20:32:42.5358811Z         else:
2025-05-07T20:32:42.5359080Z             scale_ub_tensor = None
2025-05-07T20:32:42.5359410Z     
2025-05-07T20:32:42.5359701Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5360109Z             op = silu_mul_quant
2025-05-07T20:32:42.5360435Z             if compiled:
2025-05-07T20:32:42.5360742Z                 op = torch.compile(op)
2025-05-07T20:32:42.5361134Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5361488Z     
2025-05-07T20:32:42.5361787Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.5362012Z 
2025-05-07T20:32:42.5362138Z moe/activation_test.py:117: 
2025-05-07T20:32:42.5362526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5362959Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.5363319Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5364257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.5365219Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.5365992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5366934Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5367826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5368553Z     kernel = self.compile(
2025-05-07T20:32:42.5369290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5370194Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5370744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5371065Z 
2025-05-07T20:32:42.5371356Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fb83733d0>
2025-05-07T20:32:42.5372860Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5374785Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fb83b05e0>}
2025-05-07T20:32:42.5376619Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5378016Z context = <triton._C.libtriton.ir.context object at 0x7f9fcb1e7bf0>
2025-05-07T20:32:42.5378392Z 
2025-05-07T20:32:42.5378616Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5379386Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5380031Z                            module_map=module_map)
2025-05-07T20:32:42.5380575Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5381050Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.5381401Z E       ^
2025-05-07T20:32:42.5382028Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5382721Z 
2025-05-07T20:32:42.5390038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5390883Z 
2025-05-07T20:32:42.5391044Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5391627Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5392186Z     T=2048,
2025-05-07T20:32:42.5392443Z     D=5120,
2025-05-07T20:32:42.5392689Z     scale_ub=1200.0,
2025-05-07T20:32:42.5392967Z     contiguous=True,
2025-05-07T20:32:42.5393256Z     compiled=True,
2025-05-07T20:32:42.5393512Z )
2025-05-07T20:32:42.5393930Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5394607Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.5394984Z 
2025-05-07T20:32:42.5395088Z     @given(
2025-05-07T20:32:42.5395391Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5396041Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5396460Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5396907Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5397349Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5397731Z     )
2025-05-07T20:32:42.5398205Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5398821Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5399145Z         self,
2025-05-07T20:32:42.5399411Z         T: int,
2025-05-07T20:32:42.5399680Z         D: int,
2025-05-07T20:32:42.5399969Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5400351Z         contiguous: bool,
2025-05-07T20:32:42.5400678Z         compiled: bool,
2025-05-07T20:32:42.5400982Z     ) -> None:
2025-05-07T20:32:42.5401281Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5401547Z     
2025-05-07T20:32:42.5401820Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5402164Z     
2025-05-07T20:32:42.5402352Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5402644Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5402946Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5403186Z         x0 = x[:, :D]
2025-05-07T20:32:42.5403402Z         x1 = x[:, D:]
2025-05-07T20:32:42.5403604Z     
2025-05-07T20:32:42.5404107Z         if contiguous:
2025-05-07T20:32:42.5404348Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5404601Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5404841Z     
2025-05-07T20:32:42.5405029Z         if scale_ub is not None:
2025-05-07T20:32:42.5405295Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5405635Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5405945Z             )
2025-05-07T20:32:42.5406131Z         else:
2025-05-07T20:32:42.5406340Z             scale_ub_tensor = None
2025-05-07T20:32:42.5406590Z     
2025-05-07T20:32:42.5406821Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5407129Z             op = silu_mul_quant
2025-05-07T20:32:42.5407378Z             if compiled:
2025-05-07T20:32:42.5407624Z                 op = torch.compile(op)
2025-05-07T20:32:42.5407915Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5408189Z     
2025-05-07T20:32:42.5408376Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.5408834Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.5409127Z     
2025-05-07T20:32:42.5409360Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5409684Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.5410067Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.5410382Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.5410739Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.5411056Z     
2025-05-07T20:32:42.5411253Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.5411529Z 
2025-05-07T20:32:42.5411633Z moe/activation_test.py:126: 
2025-05-07T20:32:42.5411925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5412259Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.5412587Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.5413371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.5414127Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.5414677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5415365Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5416092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.5416881Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.5417628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.5418369Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.5419092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.5419725Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.5420322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.5420833Z     fn()
2025-05-07T20:32:42.5421332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.5421907Z     self.fn.run(
2025-05-07T20:32:42.5422371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5422894Z     kernel = self.compile(
2025-05-07T20:32:42.5423427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5424075Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5424468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5424700Z 
2025-05-07T20:32:42.5424905Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fcad8e790>
2025-05-07T20:32:42.5425997Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5427402Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fcac53550>}
2025-05-07T20:32:42.5428753Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5429773Z context = <triton._C.libtriton.ir.context object at 0x7f9fcaad0b70>
2025-05-07T20:32:42.5430201Z 
2025-05-07T20:32:42.5430364Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5430884Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5442542Z                            module_map=module_map)
2025-05-07T20:32:42.5442923Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5443283Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.5443549Z E       ^
2025-05-07T20:32:42.5444016Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5444516Z 
2025-05-07T20:32:42.5444937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5445451Z 
2025-05-07T20:32:42.5445551Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5445957Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5446372Z     T=16384,
2025-05-07T20:32:42.5446584Z     D=7168,
2025-05-07T20:32:42.5446784Z     scale_ub=1200.0,
2025-05-07T20:32:42.5446999Z     contiguous=False,
2025-05-07T20:32:42.5447215Z     compiled=False,
2025-05-07T20:32:42.5447410Z )
2025-05-07T20:32:42.5447714Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5448203Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.5448482Z 
2025-05-07T20:32:42.5448553Z     @given(
2025-05-07T20:32:42.5448826Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5449131Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5449420Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5449736Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5450051Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5450331Z     )
2025-05-07T20:32:42.5450678Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5451117Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5451363Z         self,
2025-05-07T20:32:42.5451552Z         T: int,
2025-05-07T20:32:42.5451754Z         D: int,
2025-05-07T20:32:42.5451970Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5452235Z         contiguous: bool,
2025-05-07T20:32:42.5452484Z         compiled: bool,
2025-05-07T20:32:42.5452714Z     ) -> None:
2025-05-07T20:32:42.5452925Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5453171Z     
2025-05-07T20:32:42.5453448Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5453779Z     
2025-05-07T20:32:42.5453973Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5454269Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5454572Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5454815Z         x0 = x[:, :D]
2025-05-07T20:32:42.5455037Z         x1 = x[:, D:]
2025-05-07T20:32:42.5455247Z     
2025-05-07T20:32:42.5455433Z         if contiguous:
2025-05-07T20:32:42.5455667Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5455927Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5456167Z     
2025-05-07T20:32:42.5456389Z         if scale_ub is not None:
2025-05-07T20:32:42.5456676Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5457012Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5457324Z             )
2025-05-07T20:32:42.5457510Z         else:
2025-05-07T20:32:42.5457727Z             scale_ub_tensor = None
2025-05-07T20:32:42.5457973Z     
2025-05-07T20:32:42.5458201Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5458521Z             op = silu_mul_quant
2025-05-07T20:32:42.5458771Z             if compiled:
2025-05-07T20:32:42.5459018Z                 op = torch.compile(op)
2025-05-07T20:32:42.5459317Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5459637Z     
2025-05-07T20:32:42.5459831Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.5459994Z 
2025-05-07T20:32:42.5460092Z moe/activation_test.py:117: 
2025-05-07T20:32:42.5460431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5460762Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.5461039Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5461732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.5462462Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.5463004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5463676Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5464335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5464870Z     kernel = self.compile(
2025-05-07T20:32:42.5465399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5466102Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5466498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5466723Z 
2025-05-07T20:32:42.5466935Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fcae2b910>
2025-05-07T20:32:42.5468060Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5469454Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fcabfdd30>}
2025-05-07T20:32:42.5470872Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5471899Z context = <triton._C.libtriton.ir.context object at 0x7f9fca62db70>
2025-05-07T20:32:42.5472184Z 
2025-05-07T20:32:42.5472356Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5472876Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5473346Z                            module_map=module_map)
2025-05-07T20:32:42.5473711Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5474055Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.5474311Z E       ^
2025-05-07T20:32:42.5474768Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5475221Z 
2025-05-07T20:32:42.5475642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5476175Z 
2025-05-07T20:32:42.5476285Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5476713Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5477112Z     T=1,
2025-05-07T20:32:42.5477288Z     D=7168,
2025-05-07T20:32:42.5477475Z     scale_ub=None,
2025-05-07T20:32:42.5477685Z     contiguous=True,
2025-05-07T20:32:42.5477898Z     compiled=True,
2025-05-07T20:32:42.5478104Z )
2025-05-07T20:32:42.5478421Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5478900Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.5479160Z 
2025-05-07T20:32:42.5479235Z     @given(
2025-05-07T20:32:42.5479465Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5479828Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5480132Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5480458Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5480873Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5481154Z     )
2025-05-07T20:32:42.5481508Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5481953Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5482188Z         self,
2025-05-07T20:32:42.5482380Z         T: int,
2025-05-07T20:32:42.5482619Z         D: int,
2025-05-07T20:32:42.5482836Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5483099Z         contiguous: bool,
2025-05-07T20:32:42.5483336Z         compiled: bool,
2025-05-07T20:32:42.5483555Z     ) -> None:
2025-05-07T20:32:42.5483760Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5483997Z     
2025-05-07T20:32:42.5484266Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5484600Z     
2025-05-07T20:32:42.5484790Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5485076Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5485375Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5485619Z         x0 = x[:, :D]
2025-05-07T20:32:42.5485837Z         x1 = x[:, D:]
2025-05-07T20:32:42.5486034Z     
2025-05-07T20:32:42.5486215Z         if contiguous:
2025-05-07T20:32:42.5486446Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5486692Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5486934Z     
2025-05-07T20:32:42.5487171Z         if scale_ub is not None:
2025-05-07T20:32:42.5487439Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5487772Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5488081Z             )
2025-05-07T20:32:42.5488269Z         else:
2025-05-07T20:32:42.5488467Z             scale_ub_tensor = None
2025-05-07T20:32:42.5488718Z     
2025-05-07T20:32:42.5488946Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5489254Z             op = silu_mul_quant
2025-05-07T20:32:42.5489501Z             if compiled:
2025-05-07T20:32:42.5489745Z                 op = torch.compile(op)
2025-05-07T20:32:42.5490037Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5490305Z     
2025-05-07T20:32:42.5490495Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.5490770Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.5491058Z     
2025-05-07T20:32:42.5491292Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5491622Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.5491919Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.5492232Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.5492587Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.5492887Z     
2025-05-07T20:32:42.5493087Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.5493279Z 
2025-05-07T20:32:42.5493385Z moe/activation_test.py:126: 
2025-05-07T20:32:42.5493674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5494010Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.5494343Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.5495123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.5495939Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.5496490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5497171Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5497849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.5498627Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.5499380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.5500172Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.5500896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.5501539Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.5502196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.5502713Z     fn()
2025-05-07T20:32:42.5503206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.5504121Z     self.fn.run(
2025-05-07T20:32:42.5504635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5505155Z     kernel = self.compile(
2025-05-07T20:32:42.5505698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5506401Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5506796Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5507021Z 
2025-05-07T20:32:42.5507322Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fcac4a250>
2025-05-07T20:32:42.5508412Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5509787Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fcabfde50>}
2025-05-07T20:32:42.5511193Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5512223Z context = <triton._C.libtriton.ir.context object at 0x7f9fca74a770>
2025-05-07T20:32:42.5512510Z 
2025-05-07T20:32:42.5512684Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5513204Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5513675Z                            module_map=module_map)
2025-05-07T20:32:42.5514041Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5514390Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.5514659Z E       ^
2025-05-07T20:32:42.5515121Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5515571Z 
2025-05-07T20:32:42.5515996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5516504Z 
2025-05-07T20:32:42.5516606Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5517016Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5517421Z     T=4096,
2025-05-07T20:32:42.5517599Z     D=5120,
2025-05-07T20:32:42.5517787Z     scale_ub=None,
2025-05-07T20:32:42.5518009Z     contiguous=False,
2025-05-07T20:32:42.5518229Z     compiled=False,
2025-05-07T20:32:42.5518428Z )
2025-05-07T20:32:42.5518742Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5519231Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.5519500Z 
2025-05-07T20:32:42.5519669Z     @given(
2025-05-07T20:32:42.5519901Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5520210Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5520511Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5520904Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5521239Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5521520Z     )
2025-05-07T20:32:42.5521865Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5522304Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5522614Z         self,
2025-05-07T20:32:42.5522798Z         T: int,
2025-05-07T20:32:42.5522999Z         D: int,
2025-05-07T20:32:42.5523218Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5523485Z         contiguous: bool,
2025-05-07T20:32:42.5523722Z         compiled: bool,
2025-05-07T20:32:42.5523946Z     ) -> None:
2025-05-07T20:32:42.5524153Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5524399Z     
2025-05-07T20:32:42.5524673Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5525004Z     
2025-05-07T20:32:42.5525193Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5525487Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5525787Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5526039Z         x0 = x[:, :D]
2025-05-07T20:32:42.5526284Z         x1 = x[:, D:]
2025-05-07T20:32:42.5526483Z     
2025-05-07T20:32:42.5526670Z         if contiguous:
2025-05-07T20:32:42.5526898Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5527208Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5527452Z     
2025-05-07T20:32:42.5527652Z         if scale_ub is not None:
2025-05-07T20:32:42.5527921Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5528259Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5528570Z             )
2025-05-07T20:32:42.5528774Z         else:
2025-05-07T20:32:42.5528983Z             scale_ub_tensor = None
2025-05-07T20:32:42.5529234Z     
2025-05-07T20:32:42.5529464Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5529773Z             op = silu_mul_quant
2025-05-07T20:32:42.5530029Z             if compiled:
2025-05-07T20:32:42.5530279Z                 op = torch.compile(op)
2025-05-07T20:32:42.5530574Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5530847Z     
2025-05-07T20:32:42.5531038Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.5531201Z 
2025-05-07T20:32:42.5531300Z moe/activation_test.py:117: 
2025-05-07T20:32:42.5531608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5531937Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.5532218Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5532900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.5533591Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.5534125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5534800Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5535461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5535993Z     kernel = self.compile(
2025-05-07T20:32:42.5536575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5537226Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5537623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5537847Z 
2025-05-07T20:32:42.5538058Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fca8d51f0>
2025-05-07T20:32:42.5539200Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5540620Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fca83b5e0>}
2025-05-07T20:32:42.5541977Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5543051Z context = <triton._C.libtriton.ir.context object at 0x7f9fca696330>
2025-05-07T20:32:42.5543339Z 
2025-05-07T20:32:42.5543512Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5544030Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5544505Z                            module_map=module_map)
2025-05-07T20:32:42.5544881Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5545244Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.5545501Z E       ^
2025-05-07T20:32:42.5546023Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5546474Z 
2025-05-07T20:32:42.5546899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5547450Z 
2025-05-07T20:32:42.5547560Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5547971Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5548369Z     T=4096,
2025-05-07T20:32:42.5548556Z     D=7168,
2025-05-07T20:32:42.5548746Z     scale_ub=None,
2025-05-07T20:32:42.5548957Z     contiguous=False,
2025-05-07T20:32:42.5549184Z     compiled=False,
2025-05-07T20:32:42.5549375Z )
2025-05-07T20:32:42.5549687Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5550223Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.5550497Z 
2025-05-07T20:32:42.5550571Z     @given(
2025-05-07T20:32:42.5550800Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5551107Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5551413Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5551737Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5552059Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5552337Z     )
2025-05-07T20:32:42.5552673Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5553106Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5553346Z         self,
2025-05-07T20:32:42.5553530Z         T: int,
2025-05-07T20:32:42.5553725Z         D: int,
2025-05-07T20:32:42.5553938Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5554195Z         contiguous: bool,
2025-05-07T20:32:42.5554428Z         compiled: bool,
2025-05-07T20:32:42.5554644Z     ) -> None:
2025-05-07T20:32:42.5554853Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5555088Z     
2025-05-07T20:32:42.5555351Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5555681Z     
2025-05-07T20:32:42.5555870Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5556156Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5556468Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5556695Z         x0 = x[:, :D]
2025-05-07T20:32:42.5556906Z         x1 = x[:, D:]
2025-05-07T20:32:42.5557102Z     
2025-05-07T20:32:42.5557272Z         if contiguous:
2025-05-07T20:32:42.5557495Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5557745Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5558048Z     
2025-05-07T20:32:42.5558231Z         if scale_ub is not None:
2025-05-07T20:32:42.5558497Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5558820Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5559161Z             )
2025-05-07T20:32:42.5559347Z         else:
2025-05-07T20:32:42.5559544Z             scale_ub_tensor = None
2025-05-07T20:32:42.5559792Z     
2025-05-07T20:32:42.5560017Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5560318Z             op = silu_mul_quant
2025-05-07T20:32:42.5560614Z             if compiled:
2025-05-07T20:32:42.5560855Z                 op = torch.compile(op)
2025-05-07T20:32:42.5561148Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5561412Z     
2025-05-07T20:32:42.5561596Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.5561758Z 
2025-05-07T20:32:42.5561862Z moe/activation_test.py:117: 
2025-05-07T20:32:42.5562152Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5562479Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.5562752Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5563437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.5564120Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.5564647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5565366Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5566021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5566546Z     kernel = self.compile(
2025-05-07T20:32:42.5567079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5567727Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5568121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5568348Z 
2025-05-07T20:32:42.5568557Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fcb02a190>
2025-05-07T20:32:42.5569637Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5571026Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fcabfd8b0>}
2025-05-07T20:32:42.5572372Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5573400Z context = <triton._C.libtriton.ir.context object at 0x7f9fca15f9b0>
2025-05-07T20:32:42.5573690Z 
2025-05-07T20:32:42.5573855Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5574378Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5574840Z                            module_map=module_map)
2025-05-07T20:32:42.5575200Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5575550Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.5575807Z E       ^
2025-05-07T20:32:42.5576316Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5576768Z 
2025-05-07T20:32:42.5577181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5577690Z 
2025-05-07T20:32:42.5577842Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5578245Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5578640Z     T=128,
2025-05-07T20:32:42.5578817Z     D=7168,
2025-05-07T20:32:42.5578998Z     scale_ub=None,
2025-05-07T20:32:42.5579250Z     contiguous=False,
2025-05-07T20:32:42.5579469Z     compiled=True,
2025-05-07T20:32:42.5579659Z )
2025-05-07T20:32:42.5579968Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5580455Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.5580763Z 
2025-05-07T20:32:42.5580841Z     @given(
2025-05-07T20:32:42.5581057Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5581364Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5581667Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5581984Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5582312Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5582593Z     )
2025-05-07T20:32:42.5582932Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5583369Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5583613Z         self,
2025-05-07T20:32:42.5583808Z         T: int,
2025-05-07T20:32:42.5583994Z         D: int,
2025-05-07T20:32:42.5584211Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5584482Z         contiguous: bool,
2025-05-07T20:32:42.5584710Z         compiled: bool,
2025-05-07T20:32:42.5585061Z     ) -> None:
2025-05-07T20:32:42.5585446Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5585894Z     
2025-05-07T20:32:42.5586299Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5586711Z     
2025-05-07T20:32:42.5587035Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5587447Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5594209Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5594479Z         x0 = x[:, :D]
2025-05-07T20:32:42.5594690Z         x1 = x[:, D:]
2025-05-07T20:32:42.5594896Z     
2025-05-07T20:32:42.5595069Z         if contiguous:
2025-05-07T20:32:42.5595291Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5595545Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5595775Z     
2025-05-07T20:32:42.5595959Z         if scale_ub is not None:
2025-05-07T20:32:42.5596224Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5596552Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5596862Z             )
2025-05-07T20:32:42.5597044Z         else:
2025-05-07T20:32:42.5597238Z             scale_ub_tensor = None
2025-05-07T20:32:42.5597479Z     
2025-05-07T20:32:42.5597702Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5598009Z             op = silu_mul_quant
2025-05-07T20:32:42.5598245Z             if compiled:
2025-05-07T20:32:42.5598485Z                 op = torch.compile(op)
2025-05-07T20:32:42.5598770Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5599031Z     
2025-05-07T20:32:42.5599208Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.5599488Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.5599769Z     
2025-05-07T20:32:42.5600002Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5600333Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.5600612Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.5600917Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.5601279Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.5601584Z     
2025-05-07T20:32:42.5601774Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.5601973Z 
2025-05-07T20:32:42.5602068Z moe/activation_test.py:126: 
2025-05-07T20:32:42.5602363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5602764Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.5603090Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.5604251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.5605012Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.5605555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5606282Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5607029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.5607738Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.5608488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.5609231Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.5609953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.5610582Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.5611180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.5611690Z     fn()
2025-05-07T20:32:42.5612256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.5612834Z     self.fn.run(
2025-05-07T20:32:42.5613286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5613812Z     kernel = self.compile(
2025-05-07T20:32:42.5614343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5614986Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5615383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5615609Z 
2025-05-07T20:32:42.5615833Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fca6c4be0>
2025-05-07T20:32:42.5616949Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5618322Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fca2f73a0>}
2025-05-07T20:32:42.5619666Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5620695Z context = <triton._C.libtriton.ir.context object at 0x7f9fca0d64b0>
2025-05-07T20:32:42.5620981Z 
2025-05-07T20:32:42.5621151Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5621674Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5622134Z                            module_map=module_map)
2025-05-07T20:32:42.5622495Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5622848Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.5623099Z E       ^
2025-05-07T20:32:42.5623550Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5623998Z 
2025-05-07T20:32:42.5624413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5624992Z 
2025-05-07T20:32:42.5625094Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5625496Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5625953Z     T=128,
2025-05-07T20:32:42.5626160Z     D=7168,
2025-05-07T20:32:42.5626337Z     scale_ub=None,
2025-05-07T20:32:42.5626548Z     contiguous=False,
2025-05-07T20:32:42.5626768Z     compiled=False,
2025-05-07T20:32:42.5626959Z )
2025-05-07T20:32:42.5627267Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5627795Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.5628060Z 
2025-05-07T20:32:42.5628137Z     @given(
2025-05-07T20:32:42.5628351Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5628656Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5628953Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5629277Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5629599Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5629936Z     )
2025-05-07T20:32:42.5630277Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5630709Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5630940Z         self,
2025-05-07T20:32:42.5631119Z         T: int,
2025-05-07T20:32:42.5631306Z         D: int,
2025-05-07T20:32:42.5631518Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5631772Z         contiguous: bool,
2025-05-07T20:32:42.5632056Z         compiled: bool,
2025-05-07T20:32:42.5632271Z     ) -> None:
2025-05-07T20:32:42.5632475Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5632705Z     
2025-05-07T20:32:42.5632968Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5633301Z     
2025-05-07T20:32:42.5633479Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5633762Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5634064Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5634293Z         x0 = x[:, :D]
2025-05-07T20:32:42.5634499Z         x1 = x[:, D:]
2025-05-07T20:32:42.5634698Z     
2025-05-07T20:32:42.5634870Z         if contiguous:
2025-05-07T20:32:42.5635088Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5635338Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5635568Z     
2025-05-07T20:32:42.5635773Z         if scale_ub is not None:
2025-05-07T20:32:42.5636058Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5636387Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5636686Z             )
2025-05-07T20:32:42.5636868Z         else:
2025-05-07T20:32:42.5637063Z             scale_ub_tensor = None
2025-05-07T20:32:42.5637306Z     
2025-05-07T20:32:42.5637526Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5637833Z             op = silu_mul_quant
2025-05-07T20:32:42.5638073Z             if compiled:
2025-05-07T20:32:42.5638310Z                 op = torch.compile(op)
2025-05-07T20:32:42.5638599Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5638864Z     
2025-05-07T20:32:42.5639051Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.5639212Z 
2025-05-07T20:32:42.5639311Z moe/activation_test.py:117: 
2025-05-07T20:32:42.5639591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5639915Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.5640185Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5640872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.5641555Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.5642087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5642822Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5643468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5643988Z     kernel = self.compile(
2025-05-07T20:32:42.5644563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5645207Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5645595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5645904Z 
2025-05-07T20:32:42.5646106Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fca2c7280>
2025-05-07T20:32:42.5647185Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5648564Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fca2dfd30>}
2025-05-07T20:32:42.5649910Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5650931Z context = <triton._C.libtriton.ir.context object at 0x7f9fca0a22b0>
2025-05-07T20:32:42.5651224Z 
2025-05-07T20:32:42.5651459Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5651980Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5652432Z                            module_map=module_map)
2025-05-07T20:32:42.5652790Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5653136Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.5653385Z E       ^
2025-05-07T20:32:42.5653837Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5654288Z 
2025-05-07T20:32:42.5654702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5655209Z 
2025-05-07T20:32:42.5655315Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5655714Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5656152Z     T=4096,
2025-05-07T20:32:42.5656335Z     D=5120,
2025-05-07T20:32:42.5656512Z     scale_ub=1200.0,
2025-05-07T20:32:42.5656719Z     contiguous=True,
2025-05-07T20:32:42.5656936Z     compiled=False,
2025-05-07T20:32:42.5657134Z )
2025-05-07T20:32:42.5657437Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5657922Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.5658193Z 
2025-05-07T20:32:42.5658270Z     @given(
2025-05-07T20:32:42.5658485Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5658790Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5659090Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5659403Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5659723Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5660002Z     )
2025-05-07T20:32:42.5660343Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5660771Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5661006Z         self,
2025-05-07T20:32:42.5661187Z         T: int,
2025-05-07T20:32:42.5661370Z         D: int,
2025-05-07T20:32:42.5661576Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5661834Z         contiguous: bool,
2025-05-07T20:32:42.5662057Z         compiled: bool,
2025-05-07T20:32:42.5662320Z     ) -> None:
2025-05-07T20:32:42.5662522Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5662757Z     
2025-05-07T20:32:42.5663018Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5663347Z     
2025-05-07T20:32:42.5663568Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5663847Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5664146Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5664380Z         x0 = x[:, :D]
2025-05-07T20:32:42.5664581Z         x1 = x[:, D:]
2025-05-07T20:32:42.5664775Z     
2025-05-07T20:32:42.5664992Z         if contiguous:
2025-05-07T20:32:42.5665211Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5665454Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5665684Z     
2025-05-07T20:32:42.5665862Z         if scale_ub is not None:
2025-05-07T20:32:42.5666118Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5666440Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5666742Z             )
2025-05-07T20:32:42.5666915Z         else:
2025-05-07T20:32:42.5667114Z             scale_ub_tensor = None
2025-05-07T20:32:42.5667354Z     
2025-05-07T20:32:42.5667570Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5667877Z             op = silu_mul_quant
2025-05-07T20:32:42.5668113Z             if compiled:
2025-05-07T20:32:42.5668343Z                 op = torch.compile(op)
2025-05-07T20:32:42.5668629Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5668890Z     
2025-05-07T20:32:42.5669110Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.5669284Z 
2025-05-07T20:32:42.5669378Z moe/activation_test.py:117: 
2025-05-07T20:32:42.5669659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5670021Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.5670287Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5670963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.5671645Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.5672171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5672838Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5673488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5674005Z     kernel = self.compile(
2025-05-07T20:32:42.5674384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5674554Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5674676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5674681Z 
2025-05-07T20:32:42.5674884Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fca3d9fa0>
2025-05-07T20:32:42.5675665Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5676220Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fca3121f0>}
2025-05-07T20:32:42.5676969Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5677157Z context = <triton._C.libtriton.ir.context object at 0x7f9fc9c360f0>
2025-05-07T20:32:42.5677162Z 
2025-05-07T20:32:42.5677322Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5677633Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5677734Z                            module_map=module_map)
2025-05-07T20:32:42.5677931Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5678031Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.5678101Z E       ^
2025-05-07T20:32:42.5678468Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5678474Z 
2025-05-07T20:32:42.5678925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5678930Z 
2025-05-07T20:32:42.5679026Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5679247Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5679317Z     T=1,
2025-05-07T20:32:42.5679390Z     D=5120,
2025-05-07T20:32:42.5679473Z     scale_ub=None,
2025-05-07T20:32:42.5679552Z     contiguous=True,
2025-05-07T20:32:42.5679632Z     compiled=True,
2025-05-07T20:32:42.5679699Z )
2025-05-07T20:32:42.5679917Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5680082Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.5680087Z 
2025-05-07T20:32:42.5680161Z     @given(
2025-05-07T20:32:42.5680275Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5680372Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5680526Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5680644Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5680751Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5680819Z     )
2025-05-07T20:32:42.5681064Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5681157Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5681229Z         self,
2025-05-07T20:32:42.5681307Z         T: int,
2025-05-07T20:32:42.5681379Z         D: int,
2025-05-07T20:32:42.5681474Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5681561Z         contiguous: bool,
2025-05-07T20:32:42.5681644Z         compiled: bool,
2025-05-07T20:32:42.5681714Z     ) -> None:
2025-05-07T20:32:42.5681807Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5681874Z     
2025-05-07T20:32:42.5682036Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5682107Z     
2025-05-07T20:32:42.5682199Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5682323Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5682407Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5682480Z         x0 = x[:, :D]
2025-05-07T20:32:42.5682555Z         x1 = x[:, D:]
2025-05-07T20:32:42.5682624Z     
2025-05-07T20:32:42.5682703Z         if contiguous:
2025-05-07T20:32:42.5682797Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5682881Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5682948Z     
2025-05-07T20:32:42.5683038Z         if scale_ub is not None:
2025-05-07T20:32:42.5683137Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5683273Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5683346Z             )
2025-05-07T20:32:42.5683419Z         else:
2025-05-07T20:32:42.5683509Z             scale_ub_tensor = None
2025-05-07T20:32:42.5683575Z     
2025-05-07T20:32:42.5683698Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5683789Z             op = silu_mul_quant
2025-05-07T20:32:42.5683868Z             if compiled:
2025-05-07T20:32:42.5683963Z                 op = torch.compile(op)
2025-05-07T20:32:42.5684065Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5684131Z     
2025-05-07T20:32:42.5684215Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.5684335Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.5684450Z     
2025-05-07T20:32:42.5684582Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5684681Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.5684815Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.5684935Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.5685073Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.5685141Z     
2025-05-07T20:32:42.5685237Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.5685242Z 
2025-05-07T20:32:42.5685379Z moe/activation_test.py:126: 
2025-05-07T20:32:42.5685503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5685606Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.5685735Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.5686318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.5686430Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.5686812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5687034Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5687394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.5687688Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.5688090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.5688339Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.5688708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.5688872Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.5689211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.5689286Z     fn()
2025-05-07T20:32:42.5689679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.5689759Z     self.fn.run(
2025-05-07T20:32:42.5690091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5690181Z     kernel = self.compile(
2025-05-07T20:32:42.5690560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5690730Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5690850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5690856Z 
2025-05-07T20:32:42.5691062Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fca2229d0>
2025-05-07T20:32:42.5691840Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5692348Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fca3124c0>}
2025-05-07T20:32:42.5693088Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5693278Z context = <triton._C.libtriton.ir.context object at 0x7f9fc9b97870>
2025-05-07T20:32:42.5693326Z 
2025-05-07T20:32:42.5693488Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5693748Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5693892Z                            module_map=module_map)
2025-05-07T20:32:42.5694050Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5694149Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.5694226Z E       ^
2025-05-07T20:32:42.5694579Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5694649Z 
2025-05-07T20:32:42.5695062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5695067Z 
2025-05-07T20:32:42.5695163Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5695380Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5695458Z     T=2048,
2025-05-07T20:32:42.5695525Z     D=5120,
2025-05-07T20:32:42.5695603Z     scale_ub=None,
2025-05-07T20:32:42.5695686Z     contiguous=True,
2025-05-07T20:32:42.5695764Z     compiled=True,
2025-05-07T20:32:42.5695835Z )
2025-05-07T20:32:42.5696049Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5696215Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.5696220Z 
2025-05-07T20:32:42.5696291Z     @given(
2025-05-07T20:32:42.5696405Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5696543Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5696657Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5696769Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5696879Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5696948Z     )
2025-05-07T20:32:42.5697190Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5697284Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5697352Z         self,
2025-05-07T20:32:42.5697421Z         T: int,
2025-05-07T20:32:42.5697497Z         D: int,
2025-05-07T20:32:42.5697596Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5697680Z         contiguous: bool,
2025-05-07T20:32:42.5697762Z         compiled: bool,
2025-05-07T20:32:42.5697835Z     ) -> None:
2025-05-07T20:32:42.5697924Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5697994Z     
2025-05-07T20:32:42.5698160Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5698235Z     
2025-05-07T20:32:42.5698325Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5698443Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5698530Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5698606Z         x0 = x[:, :D]
2025-05-07T20:32:42.5698682Z         x1 = x[:, D:]
2025-05-07T20:32:42.5698758Z     
2025-05-07T20:32:42.5698836Z         if contiguous:
2025-05-07T20:32:42.5698920Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5699007Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5699080Z     
2025-05-07T20:32:42.5699166Z         if scale_ub is not None:
2025-05-07T20:32:42.5699277Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5699409Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5699482Z             )
2025-05-07T20:32:42.5699558Z         else:
2025-05-07T20:32:42.5699645Z             scale_ub_tensor = None
2025-05-07T20:32:42.5699718Z     
2025-05-07T20:32:42.5699841Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5699926Z             op = silu_mul_quant
2025-05-07T20:32:42.5700007Z             if compiled:
2025-05-07T20:32:42.5700106Z                 op = torch.compile(op)
2025-05-07T20:32:42.5700208Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5700334Z     
2025-05-07T20:32:42.5700424Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.5700539Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.5700606Z     
2025-05-07T20:32:42.5700737Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5700871Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.5700970Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.5701088Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.5701225Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.5701292Z     
2025-05-07T20:32:42.5701430Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.5701434Z 
2025-05-07T20:32:42.5701531Z moe/activation_test.py:126: 
2025-05-07T20:32:42.5701653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5701750Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.5701879Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.5702438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.5702536Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.5702895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5703114Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5703522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.5703982Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.5704416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.5704665Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.5705035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.5705203Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.5705543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.5705619Z     fn()
2025-05-07T20:32:42.5706060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.5706145Z     self.fn.run(
2025-05-07T20:32:42.5706484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5706572Z     kernel = self.compile(
2025-05-07T20:32:42.5706942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5707116Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5707238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5707242Z 
2025-05-07T20:32:42.5707445Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc9e9fe50>
2025-05-07T20:32:42.5708228Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5708735Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fca00cf70>}
2025-05-07T20:32:42.5709482Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5709755Z context = <triton._C.libtriton.ir.context object at 0x7f9fc9d4efb0>
2025-05-07T20:32:42.5709760Z 
2025-05-07T20:32:42.5709968Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5710292Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5710402Z                            module_map=module_map)
2025-05-07T20:32:42.5710563Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5710662Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.5710736Z E       ^
2025-05-07T20:32:42.5711159Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5711164Z 
2025-05-07T20:32:42.5711575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5711580Z 
2025-05-07T20:32:42.5711678Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5711900Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5711974Z     T=128,
2025-05-07T20:32:42.5712048Z     D=5120,
2025-05-07T20:32:42.5712125Z     scale_ub=None,
2025-05-07T20:32:42.5712213Z     contiguous=True,
2025-05-07T20:32:42.5712294Z     compiled=True,
2025-05-07T20:32:42.5712362Z )
2025-05-07T20:32:42.5712587Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5712751Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.5712755Z 
2025-05-07T20:32:42.5712888Z     @given(
2025-05-07T20:32:42.5713005Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5713100Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5713211Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5713325Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5713434Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5713509Z     )
2025-05-07T20:32:42.5713750Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5713838Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5713910Z         self,
2025-05-07T20:32:42.5713984Z         T: int,
2025-05-07T20:32:42.5714055Z         D: int,
2025-05-07T20:32:42.5714151Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5714236Z         contiguous: bool,
2025-05-07T20:32:42.5714317Z         compiled: bool,
2025-05-07T20:32:42.5714393Z     ) -> None:
2025-05-07T20:32:42.5714484Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5714555Z     
2025-05-07T20:32:42.5714724Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5714793Z     
2025-05-07T20:32:42.5714879Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5715000Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5715082Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5715160Z         x0 = x[:, :D]
2025-05-07T20:32:42.5715231Z         x1 = x[:, D:]
2025-05-07T20:32:42.5715297Z     
2025-05-07T20:32:42.5715380Z         if contiguous:
2025-05-07T20:32:42.5715466Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5715553Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5715623Z     
2025-05-07T20:32:42.5715708Z         if scale_ub is not None:
2025-05-07T20:32:42.5715809Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5715966Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5716043Z             )
2025-05-07T20:32:42.5716140Z         else:
2025-05-07T20:32:42.5716234Z             scale_ub_tensor = None
2025-05-07T20:32:42.5716301Z     
2025-05-07T20:32:42.5716426Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5716512Z             op = silu_mul_quant
2025-05-07T20:32:42.5716590Z             if compiled:
2025-05-07T20:32:42.5716688Z                 op = torch.compile(op)
2025-05-07T20:32:42.5716837Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5716905Z     
2025-05-07T20:32:42.5716994Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.5717109Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.5717179Z     
2025-05-07T20:32:42.5717350Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5717448Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.5717540Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.5717661Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.5717836Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.5717906Z     
2025-05-07T20:32:42.5718001Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.5718005Z 
2025-05-07T20:32:42.5718095Z moe/activation_test.py:126: 
2025-05-07T20:32:42.5718220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5718321Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.5718452Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.5719014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.5719109Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.5719469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5719683Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5720081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.5720337Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.5720727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.5720981Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.5721349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.5721513Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.5721849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.5721918Z     fn()
2025-05-07T20:32:42.5722312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.5722394Z     self.fn.run(
2025-05-07T20:32:42.5722724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5722815Z     kernel = self.compile(
2025-05-07T20:32:42.5723187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5723361Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5723485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5723489Z 
2025-05-07T20:32:42.5723696Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc9d66700>
2025-05-07T20:32:42.5724481Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5724990Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc9e93b80>}
2025-05-07T20:32:42.5725727Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5725961Z context = <triton._C.libtriton.ir.context object at 0x7f9fc98cea30>
2025-05-07T20:32:42.5725965Z 
2025-05-07T20:32:42.5726190Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5726455Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5726558Z                            module_map=module_map)
2025-05-07T20:32:42.5726717Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5726862Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.5726936Z E       ^
2025-05-07T20:32:42.5727286Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5727294Z 
2025-05-07T20:32:42.5727701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5727708Z 
2025-05-07T20:32:42.5727808Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5728033Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5728103Z     T=4096,
2025-05-07T20:32:42.5728178Z     D=5120,
2025-05-07T20:32:42.5728255Z     scale_ub=None,
2025-05-07T20:32:42.5728334Z     contiguous=True,
2025-05-07T20:32:42.5728416Z     compiled=True,
2025-05-07T20:32:42.5728484Z )
2025-05-07T20:32:42.5728700Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5728909Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.5728914Z 
2025-05-07T20:32:42.5728988Z     @given(
2025-05-07T20:32:42.5729102Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5729199Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5729307Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5729425Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5729626Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5729809Z     )
2025-05-07T20:32:42.5736028Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5736165Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5736244Z         self,
2025-05-07T20:32:42.5736338Z         T: int,
2025-05-07T20:32:42.5736408Z         D: int,
2025-05-07T20:32:42.5736502Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5736586Z         contiguous: bool,
2025-05-07T20:32:42.5736672Z         compiled: bool,
2025-05-07T20:32:42.5736749Z     ) -> None:
2025-05-07T20:32:42.5736837Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5736905Z     
2025-05-07T20:32:42.5737076Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5737145Z     
2025-05-07T20:32:42.5737232Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5737359Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5737445Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5737520Z         x0 = x[:, :D]
2025-05-07T20:32:42.5737597Z         x1 = x[:, D:]
2025-05-07T20:32:42.5737664Z     
2025-05-07T20:32:42.5737746Z         if contiguous:
2025-05-07T20:32:42.5737835Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5737920Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5737990Z     
2025-05-07T20:32:42.5738075Z         if scale_ub is not None:
2025-05-07T20:32:42.5738172Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5738309Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5738381Z             )
2025-05-07T20:32:42.5738454Z         else:
2025-05-07T20:32:42.5738545Z             scale_ub_tensor = None
2025-05-07T20:32:42.5738612Z     
2025-05-07T20:32:42.5738737Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5738825Z             op = silu_mul_quant
2025-05-07T20:32:42.5738970Z             if compiled:
2025-05-07T20:32:42.5739069Z                 op = torch.compile(op)
2025-05-07T20:32:42.5739168Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5739236Z     
2025-05-07T20:32:42.5739323Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.5739484Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.5739552Z     
2025-05-07T20:32:42.5739688Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5739783Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.5739875Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.5740042Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.5740178Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.5740249Z     
2025-05-07T20:32:42.5740348Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.5740353Z 
2025-05-07T20:32:42.5740450Z moe/activation_test.py:126: 
2025-05-07T20:32:42.5740583Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5740683Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.5740813Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.5741376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.5741472Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.5741829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5742093Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5742456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.5742710Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.5743103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.5743350Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.5743722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.5743885Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.5744225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.5744306Z     fn()
2025-05-07T20:32:42.5744700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.5744781Z     self.fn.run(
2025-05-07T20:32:42.5745109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5745199Z     kernel = self.compile(
2025-05-07T20:32:42.5745577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5745749Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5745878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5745883Z 
2025-05-07T20:32:42.5746085Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc9bd2c70>
2025-05-07T20:32:42.5746864Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5747376Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc99e0c10>}
2025-05-07T20:32:42.5748113Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5748352Z context = <triton._C.libtriton.ir.context object at 0x7f9fc957a330>
2025-05-07T20:32:42.5748395Z 
2025-05-07T20:32:42.5748555Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5748816Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5748917Z                            module_map=module_map)
2025-05-07T20:32:42.5749118Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5749216Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.5749288Z E       ^
2025-05-07T20:32:42.5749637Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5749642Z 
2025-05-07T20:32:42.5750135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5750140Z 
2025-05-07T20:32:42.5750237Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5750460Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5750533Z     T=16384,
2025-05-07T20:32:42.5750601Z     D=5120,
2025-05-07T20:32:42.5750679Z     scale_ub=None,
2025-05-07T20:32:42.5750759Z     contiguous=True,
2025-05-07T20:32:42.5750837Z     compiled=True,
2025-05-07T20:32:42.5750911Z )
2025-05-07T20:32:42.5751194Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5751366Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.5751371Z 
2025-05-07T20:32:42.5751447Z     @given(
2025-05-07T20:32:42.5751560Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5751656Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5751773Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5751883Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5751993Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5752064Z     )
2025-05-07T20:32:42.5752305Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5752397Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5752469Z         self,
2025-05-07T20:32:42.5752540Z         T: int,
2025-05-07T20:32:42.5752614Z         D: int,
2025-05-07T20:32:42.5752711Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5752798Z         contiguous: bool,
2025-05-07T20:32:42.5752883Z         compiled: bool,
2025-05-07T20:32:42.5752957Z     ) -> None:
2025-05-07T20:32:42.5753046Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5753115Z     
2025-05-07T20:32:42.5753278Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5753350Z     
2025-05-07T20:32:42.5753444Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5753565Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5753651Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5753725Z         x0 = x[:, :D]
2025-05-07T20:32:42.5753800Z         x1 = x[:, D:]
2025-05-07T20:32:42.5753873Z     
2025-05-07T20:32:42.5753950Z         if contiguous:
2025-05-07T20:32:42.5754037Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5754124Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5754190Z     
2025-05-07T20:32:42.5754274Z         if scale_ub is not None:
2025-05-07T20:32:42.5754381Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5754512Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5754585Z             )
2025-05-07T20:32:42.5754659Z         else:
2025-05-07T20:32:42.5754750Z             scale_ub_tensor = None
2025-05-07T20:32:42.5754821Z     
2025-05-07T20:32:42.5754949Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5755083Z             op = silu_mul_quant
2025-05-07T20:32:42.5755165Z             if compiled:
2025-05-07T20:32:42.5755260Z                 op = torch.compile(op)
2025-05-07T20:32:42.5755360Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5755469Z     
2025-05-07T20:32:42.5755557Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.5755676Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.5755749Z     
2025-05-07T20:32:42.5755880Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5755978Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.5756118Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.5756237Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.5756375Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.5756443Z     
2025-05-07T20:32:42.5756537Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.5756544Z 
2025-05-07T20:32:42.5756640Z moe/activation_test.py:126: 
2025-05-07T20:32:42.5756764Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5756863Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.5756998Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.5757552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.5757648Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.5758044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5758266Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5758626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.5758878Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.5759274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.5759525Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.5759893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.5760058Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.5760402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.5760471Z     fn()
2025-05-07T20:32:42.5760870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.5760950Z     self.fn.run(
2025-05-07T20:32:42.5761282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5761376Z     kernel = self.compile(
2025-05-07T20:32:42.5761750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5761930Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5762050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5762054Z 
2025-05-07T20:32:42.5762257Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc9fa34c0>
2025-05-07T20:32:42.5763042Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5763547Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc99b0c10>}
2025-05-07T20:32:42.5764370Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5764557Z context = <triton._C.libtriton.ir.context object at 0x7f9fc91cbcf0>
2025-05-07T20:32:42.5764562Z 
2025-05-07T20:32:42.5764726Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5764984Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5765125Z                            module_map=module_map)
2025-05-07T20:32:42.5765286Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5765380Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.5765450Z E       ^
2025-05-07T20:32:42.5765803Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5765810Z 
2025-05-07T20:32:42.5766242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5766247Z 
2025-05-07T20:32:42.5766362Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5766588Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5766660Z     T=1,
2025-05-07T20:32:42.5766733Z     D=5120,
2025-05-07T20:32:42.5766810Z     scale_ub=1200.0,
2025-05-07T20:32:42.5766889Z     contiguous=True,
2025-05-07T20:32:42.5767013Z     compiled=True,
2025-05-07T20:32:42.5767084Z )
2025-05-07T20:32:42.5767306Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5767466Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.5767471Z 
2025-05-07T20:32:42.5767541Z     @given(
2025-05-07T20:32:42.5767660Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5767758Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5767868Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5767982Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5768098Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5768165Z     )
2025-05-07T20:32:42.5768409Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5768499Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5768569Z         self,
2025-05-07T20:32:42.5768641Z         T: int,
2025-05-07T20:32:42.5768714Z         D: int,
2025-05-07T20:32:42.5768811Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5768897Z         contiguous: bool,
2025-05-07T20:32:42.5768976Z         compiled: bool,
2025-05-07T20:32:42.5769053Z     ) -> None:
2025-05-07T20:32:42.5769141Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5769208Z     
2025-05-07T20:32:42.5769375Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5769444Z     
2025-05-07T20:32:42.5769529Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5769651Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5769738Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5769815Z         x0 = x[:, :D]
2025-05-07T20:32:42.5769887Z         x1 = x[:, D:]
2025-05-07T20:32:42.5769954Z     
2025-05-07T20:32:42.5770033Z         if contiguous:
2025-05-07T20:32:42.5770120Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5770205Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5770282Z     
2025-05-07T20:32:42.5770367Z         if scale_ub is not None:
2025-05-07T20:32:42.5770464Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5770596Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5770668Z             )
2025-05-07T20:32:42.5770736Z         else:
2025-05-07T20:32:42.5770825Z             scale_ub_tensor = None
2025-05-07T20:32:42.5770943Z     
2025-05-07T20:32:42.5771069Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5771152Z             op = silu_mul_quant
2025-05-07T20:32:42.5771229Z             if compiled:
2025-05-07T20:32:42.5771367Z                 op = torch.compile(op)
2025-05-07T20:32:42.5771472Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5771535Z     
2025-05-07T20:32:42.5771625Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.5771629Z 
2025-05-07T20:32:42.5771721Z moe/activation_test.py:117: 
2025-05-07T20:32:42.5771848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5771984Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.5772079Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5772447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.5772533Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.5773024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.5773124Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.5773475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5773693Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5774028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5774161Z     kernel = self.compile(
2025-05-07T20:32:42.5774542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5774710Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5774829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5774837Z 
2025-05-07T20:32:42.5775040Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc9bfb610>
2025-05-07T20:32:42.5775819Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5776325Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc9256670>}
2025-05-07T20:32:42.5777066Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5777253Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8bf17b0>
2025-05-07T20:32:42.5777261Z 
2025-05-07T20:32:42.5777418Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5777676Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5777783Z                            module_map=module_map)
2025-05-07T20:32:42.5777943Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5778035Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.5778108Z E       ^
2025-05-07T20:32:42.5778458Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5778466Z 
2025-05-07T20:32:42.5778878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5778882Z 
2025-05-07T20:32:42.5778979Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5779194Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5779268Z     T=1,
2025-05-07T20:32:42.5779385Z     D=5120,
2025-05-07T20:32:42.5779462Z     scale_ub=None,
2025-05-07T20:32:42.5779546Z     contiguous=False,
2025-05-07T20:32:42.5779623Z     compiled=True,
2025-05-07T20:32:42.5779692Z )
2025-05-07T20:32:42.5779946Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5780107Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.5780111Z 
2025-05-07T20:32:42.5780186Z     @given(
2025-05-07T20:32:42.5780299Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5780394Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5780573Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5780686Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5780794Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5780868Z     )
2025-05-07T20:32:42.5781108Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5781202Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5781273Z         self,
2025-05-07T20:32:42.5781346Z         T: int,
2025-05-07T20:32:42.5781421Z         D: int,
2025-05-07T20:32:42.5781517Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5781602Z         contiguous: bool,
2025-05-07T20:32:42.5781687Z         compiled: bool,
2025-05-07T20:32:42.5781762Z     ) -> None:
2025-05-07T20:32:42.5781852Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5781922Z     
2025-05-07T20:32:42.5782084Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5782157Z     
2025-05-07T20:32:42.5782288Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5782408Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5782492Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5782567Z         x0 = x[:, :D]
2025-05-07T20:32:42.5782641Z         x1 = x[:, D:]
2025-05-07T20:32:42.5782712Z     
2025-05-07T20:32:42.5782788Z         if contiguous:
2025-05-07T20:32:42.5782879Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5782964Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5783031Z     
2025-05-07T20:32:42.5783118Z         if scale_ub is not None:
2025-05-07T20:32:42.5783223Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5783354Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5783423Z             )
2025-05-07T20:32:42.5783496Z         else:
2025-05-07T20:32:42.5783585Z             scale_ub_tensor = None
2025-05-07T20:32:42.5783652Z     
2025-05-07T20:32:42.5783781Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5783873Z             op = silu_mul_quant
2025-05-07T20:32:42.5783958Z             if compiled:
2025-05-07T20:32:42.5784052Z                 op = torch.compile(op)
2025-05-07T20:32:42.5784152Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5784222Z     
2025-05-07T20:32:42.5784310Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.5784429Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.5784499Z     
2025-05-07T20:32:42.5784629Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5784726Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.5784826Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.5784948Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.5785080Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.5785153Z     
2025-05-07T20:32:42.5785249Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.5785258Z 
2025-05-07T20:32:42.5785355Z moe/activation_test.py:126: 
2025-05-07T20:32:42.5785477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5785575Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.5785705Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.5786258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.5786401Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.5786794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5787011Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5787371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.5787661Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.5788053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.5788304Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.5788673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.5788837Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.5789173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.5789255Z     fn()
2025-05-07T20:32:42.5789648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.5789724Z     self.fn.run(
2025-05-07T20:32:42.5790148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5790242Z     kernel = self.compile(
2025-05-07T20:32:42.5790615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5790784Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5790910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5790914Z 
2025-05-07T20:32:42.5791124Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8bccf70>
2025-05-07T20:32:42.5791907Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5792416Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc92c0dc0>}
2025-05-07T20:32:42.5793159Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5793344Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8b2a7f0>
2025-05-07T20:32:42.5793354Z 
2025-05-07T20:32:42.5793515Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5793773Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5793881Z                            module_map=module_map)
2025-05-07T20:32:42.5794039Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5794133Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.5794205Z E       ^
2025-05-07T20:32:42.5794559Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5794566Z 
2025-05-07T20:32:42.5794975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5794979Z 
2025-05-07T20:32:42.5795076Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5795293Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5795414Z     T=1,
2025-05-07T20:32:42.5795486Z     D=5120,
2025-05-07T20:32:42.5795563Z     scale_ub=None,
2025-05-07T20:32:42.5795646Z     contiguous=True,
2025-05-07T20:32:42.5795725Z     compiled=False,
2025-05-07T20:32:42.5795833Z )
2025-05-07T20:32:42.5796078Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5796261Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.5796265Z 
2025-05-07T20:32:42.5796342Z     @given(
2025-05-07T20:32:42.5796459Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5796591Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5796703Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5796814Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5796921Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5796993Z     )
2025-05-07T20:32:42.5797237Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5797325Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5797399Z         self,
2025-05-07T20:32:42.5797469Z         T: int,
2025-05-07T20:32:42.5797539Z         D: int,
2025-05-07T20:32:42.5797636Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5797719Z         contiguous: bool,
2025-05-07T20:32:42.5797803Z         compiled: bool,
2025-05-07T20:32:42.5797876Z     ) -> None:
2025-05-07T20:32:42.5797963Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5798034Z     
2025-05-07T20:32:42.5798240Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5798311Z     
2025-05-07T20:32:42.5798401Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5798519Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5798607Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5798686Z         x0 = x[:, :D]
2025-05-07T20:32:42.5798764Z         x1 = x[:, D:]
2025-05-07T20:32:42.5798833Z     
2025-05-07T20:32:42.5798911Z         if contiguous:
2025-05-07T20:32:42.5798997Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5799082Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5799150Z     
2025-05-07T20:32:42.5799241Z         if scale_ub is not None:
2025-05-07T20:32:42.5799343Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5799471Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5799544Z             )
2025-05-07T20:32:42.5799618Z         else:
2025-05-07T20:32:42.5799706Z             scale_ub_tensor = None
2025-05-07T20:32:42.5799775Z     
2025-05-07T20:32:42.5799904Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5799990Z             op = silu_mul_quant
2025-05-07T20:32:42.5800068Z             if compiled:
2025-05-07T20:32:42.5800165Z                 op = torch.compile(op)
2025-05-07T20:32:42.5800265Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5800336Z     
2025-05-07T20:32:42.5800420Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.5800425Z 
2025-05-07T20:32:42.5800516Z moe/activation_test.py:117: 
2025-05-07T20:32:42.5800639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5800738Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.5800832Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5801332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.5801427Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.5801786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5802003Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5802336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5802477Z     kernel = self.compile(
2025-05-07T20:32:42.5802852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5803061Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5803188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5803192Z 
2025-05-07T20:32:42.5803393Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc90e7730>
2025-05-07T20:32:42.5804459Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5805061Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc926edc0>}
2025-05-07T20:32:42.5805809Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5806000Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8a76d30>
2025-05-07T20:32:42.5806006Z 
2025-05-07T20:32:42.5806168Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5806429Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5806595Z                            module_map=module_map)
2025-05-07T20:32:42.5806758Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5806851Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.5806923Z E       ^
2025-05-07T20:32:42.5807279Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5807286Z 
2025-05-07T20:32:42.5807695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5807700Z 
2025-05-07T20:32:42.5807798Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5808021Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5808091Z     T=128,
2025-05-07T20:32:42.5808167Z     D=5120,
2025-05-07T20:32:42.5808243Z     scale_ub=None,
2025-05-07T20:32:42.5808325Z     contiguous=False,
2025-05-07T20:32:42.5808404Z     compiled=True,
2025-05-07T20:32:42.5808476Z )
2025-05-07T20:32:42.5808692Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5808864Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.5808869Z 
2025-05-07T20:32:42.5808944Z     @given(
2025-05-07T20:32:42.5809058Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5809156Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5809265Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5809382Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5809490Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5809562Z     )
2025-05-07T20:32:42.5809804Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5809893Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5809964Z         self,
2025-05-07T20:32:42.5810037Z         T: int,
2025-05-07T20:32:42.5810109Z         D: int,
2025-05-07T20:32:42.5810208Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5810297Z         contiguous: bool,
2025-05-07T20:32:42.5810377Z         compiled: bool,
2025-05-07T20:32:42.5810450Z     ) -> None:
2025-05-07T20:32:42.5810543Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5810611Z     
2025-05-07T20:32:42.5810777Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5810914Z     
2025-05-07T20:32:42.5811000Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5811121Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5811204Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5811360Z         x0 = x[:, :D]
2025-05-07T20:32:42.5811438Z         x1 = x[:, D:]
2025-05-07T20:32:42.5811504Z     
2025-05-07T20:32:42.5811581Z         if contiguous:
2025-05-07T20:32:42.5811671Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5811753Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5811817Z     
2025-05-07T20:32:42.5811950Z         if scale_ub is not None:
2025-05-07T20:32:42.5812049Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5812183Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5812256Z             )
2025-05-07T20:32:42.5812329Z         else:
2025-05-07T20:32:42.5812419Z             scale_ub_tensor = None
2025-05-07T20:32:42.5812487Z     
2025-05-07T20:32:42.5812613Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5812701Z             op = silu_mul_quant
2025-05-07T20:32:42.5812780Z             if compiled:
2025-05-07T20:32:42.5812872Z                 op = torch.compile(op)
2025-05-07T20:32:42.5812981Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5813048Z     
2025-05-07T20:32:42.5813134Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.5813138Z 
2025-05-07T20:32:42.5813235Z moe/activation_test.py:117: 
2025-05-07T20:32:42.5813357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5813501Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.5813599Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5813962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.5814055Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.5814545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.5814640Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.5814998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5815218Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5815553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5815643Z     kernel = self.compile(
2025-05-07T20:32:42.5816021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5816191Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5816313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5816318Z 
2025-05-07T20:32:42.5816524Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8a8a370>
2025-05-07T20:32:42.5817301Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5817807Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8ee8040>}
2025-05-07T20:32:42.5818549Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5818737Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8ab6370>
2025-05-07T20:32:42.5818742Z 
2025-05-07T20:32:42.5818905Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5819205Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5819308Z                            module_map=module_map)
2025-05-07T20:32:42.5819467Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5819597Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.5819677Z E       ^
2025-05-07T20:32:42.5820025Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5820030Z 
2025-05-07T20:32:42.5820441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5820484Z 
2025-05-07T20:32:42.5820583Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5820798Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5820870Z     T=128,
2025-05-07T20:32:42.5820946Z     D=7168,
2025-05-07T20:32:42.5821027Z     scale_ub=1200.0,
2025-05-07T20:32:42.5821109Z     contiguous=False,
2025-05-07T20:32:42.5821187Z     compiled=False,
2025-05-07T20:32:42.5821254Z )
2025-05-07T20:32:42.5821468Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5821640Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.5821645Z 
2025-05-07T20:32:42.5821714Z     @given(
2025-05-07T20:32:42.5821831Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5821924Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5822077Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5822195Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5822304Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5822376Z     )
2025-05-07T20:32:42.5822616Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5822704Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5822783Z         self,
2025-05-07T20:32:42.5822854Z         T: int,
2025-05-07T20:32:42.5822923Z         D: int,
2025-05-07T20:32:42.5823021Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5823103Z         contiguous: bool,
2025-05-07T20:32:42.5823185Z         compiled: bool,
2025-05-07T20:32:42.5823260Z     ) -> None:
2025-05-07T20:32:42.5823350Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5823418Z     
2025-05-07T20:32:42.5823583Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5823650Z     
2025-05-07T20:32:42.5823745Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5823863Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5823946Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5824021Z         x0 = x[:, :D]
2025-05-07T20:32:42.5824095Z         x1 = x[:, D:]
2025-05-07T20:32:42.5824162Z     
2025-05-07T20:32:42.5824243Z         if contiguous:
2025-05-07T20:32:42.5824328Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5824414Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5824486Z     
2025-05-07T20:32:42.5824571Z         if scale_ub is not None:
2025-05-07T20:32:42.5824670Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5824803Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5824876Z             )
2025-05-07T20:32:42.5824952Z         else:
2025-05-07T20:32:42.5825043Z             scale_ub_tensor = None
2025-05-07T20:32:42.5825112Z     
2025-05-07T20:32:42.5825239Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5825327Z             op = silu_mul_quant
2025-05-07T20:32:42.5825409Z             if compiled:
2025-05-07T20:32:42.5825506Z                 op = torch.compile(op)
2025-05-07T20:32:42.5825607Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5825671Z     
2025-05-07T20:32:42.5825759Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.5825763Z 
2025-05-07T20:32:42.5825905Z moe/activation_test.py:117: 
2025-05-07T20:32:42.5826027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5826132Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.5826228Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5826773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.5826867Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.5827218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5827485Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5827817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5827907Z     kernel = self.compile(
2025-05-07T20:32:42.5828282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5828454Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5828579Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5828586Z 
2025-05-07T20:32:42.5828788Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8adf1f0>
2025-05-07T20:32:42.5829599Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5830191Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8ee8ca0>}
2025-05-07T20:32:42.5830927Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5831118Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8b019b0>
2025-05-07T20:32:42.5831122Z 
2025-05-07T20:32:42.5831285Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5831544Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5831646Z                            module_map=module_map)
2025-05-07T20:32:42.5831803Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5831906Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.5831979Z E       ^
2025-05-07T20:32:42.5832337Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5832342Z 
2025-05-07T20:32:42.5832750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5832757Z 
2025-05-07T20:32:42.5832853Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5833076Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5833147Z     T=128,
2025-05-07T20:32:42.5833219Z     D=5120,
2025-05-07T20:32:42.5833299Z     scale_ub=None,
2025-05-07T20:32:42.5833379Z     contiguous=False,
2025-05-07T20:32:42.5833460Z     compiled=False,
2025-05-07T20:32:42.5833533Z )
2025-05-07T20:32:42.5833751Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5833922Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.5833927Z 
2025-05-07T20:32:42.5834001Z     @given(
2025-05-07T20:32:42.5834115Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5834213Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5834323Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5834483Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5834596Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5834666Z     )
2025-05-07T20:32:42.5834907Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5835040Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5835114Z         self,
2025-05-07T20:32:42.5835190Z         T: int,
2025-05-07T20:32:42.5835261Z         D: int,
2025-05-07T20:32:42.5835355Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5835445Z         contiguous: bool,
2025-05-07T20:32:42.5835566Z         compiled: bool,
2025-05-07T20:32:42.5835638Z     ) -> None:
2025-05-07T20:32:42.5835731Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5835799Z     
2025-05-07T20:32:42.5835961Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5836031Z     
2025-05-07T20:32:42.5836117Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5836247Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5836351Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5836435Z         x0 = x[:, :D]
2025-05-07T20:32:42.5836523Z         x1 = x[:, D:]
2025-05-07T20:32:42.5836593Z     
2025-05-07T20:32:42.5836670Z         if contiguous:
2025-05-07T20:32:42.5836762Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5836846Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5836911Z     
2025-05-07T20:32:42.5837000Z         if scale_ub is not None:
2025-05-07T20:32:42.5837101Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5837278Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5837355Z             )
2025-05-07T20:32:42.5837427Z         else:
2025-05-07T20:32:42.5837516Z             scale_ub_tensor = None
2025-05-07T20:32:42.5837583Z     
2025-05-07T20:32:42.5837706Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5837791Z             op = silu_mul_quant
2025-05-07T20:32:42.5837877Z             if compiled:
2025-05-07T20:32:42.5837971Z                 op = torch.compile(op)
2025-05-07T20:32:42.5838080Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5838148Z     
2025-05-07T20:32:42.5838232Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.5838240Z 
2025-05-07T20:32:42.5838335Z moe/activation_test.py:117: 
2025-05-07T20:32:42.5838461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5838555Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.5838653Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5839154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.5839247Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.5839604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5839821Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5840158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5840247Z     kernel = self.compile(
2025-05-07T20:32:42.5840623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5840796Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5840917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5840924Z 
2025-05-07T20:32:42.5841130Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8aae1c0>
2025-05-07T20:32:42.5841901Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5842476Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc89fd310>}
2025-05-07T20:32:42.5843252Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5843439Z context = <triton._C.libtriton.ir.context object at 0x7f9fc89e0530>
2025-05-07T20:32:42.5843444Z 
2025-05-07T20:32:42.5843610Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5843909Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5844010Z                            module_map=module_map)
2025-05-07T20:32:42.5844170Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5844263Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.5844340Z E       ^
2025-05-07T20:32:42.5844697Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5844702Z 
2025-05-07T20:32:42.5845112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5845117Z 
2025-05-07T20:32:42.5845216Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5845431Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5845508Z     T=128,
2025-05-07T20:32:42.5845620Z     D=5120,
2025-05-07T20:32:42.5845700Z     scale_ub=1200.0,
2025-05-07T20:32:42.5845781Z     contiguous=True,
2025-05-07T20:32:42.5845859Z     compiled=False,
2025-05-07T20:32:42.5845926Z )
2025-05-07T20:32:42.5846139Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5846303Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.5846312Z 
2025-05-07T20:32:42.5846382Z     @given(
2025-05-07T20:32:42.5846501Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5846597Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5846711Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5846822Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5846930Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5847000Z     )
2025-05-07T20:32:42.5847239Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5847335Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5847413Z         self,
2025-05-07T20:32:42.5847483Z         T: int,
2025-05-07T20:32:42.5847553Z         D: int,
2025-05-07T20:32:42.5847650Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5847733Z         contiguous: bool,
2025-05-07T20:32:42.5847814Z         compiled: bool,
2025-05-07T20:32:42.5847893Z     ) -> None:
2025-05-07T20:32:42.5847984Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5848057Z     
2025-05-07T20:32:42.5848221Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5848290Z     
2025-05-07T20:32:42.5848385Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5848506Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5848588Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5848666Z         x0 = x[:, :D]
2025-05-07T20:32:42.5848741Z         x1 = x[:, D:]
2025-05-07T20:32:42.5848807Z     
2025-05-07T20:32:42.5848887Z         if contiguous:
2025-05-07T20:32:42.5848978Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5849063Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5849132Z     
2025-05-07T20:32:42.5849218Z         if scale_ub is not None:
2025-05-07T20:32:42.5849317Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5849451Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5849571Z             )
2025-05-07T20:32:42.5849644Z         else:
2025-05-07T20:32:42.5849733Z             scale_ub_tensor = None
2025-05-07T20:32:42.5849801Z     
2025-05-07T20:32:42.5849926Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5850049Z             op = silu_mul_quant
2025-05-07T20:32:42.5850129Z             if compiled:
2025-05-07T20:32:42.5850226Z                 op = torch.compile(op)
2025-05-07T20:32:42.5850326Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5850390Z     
2025-05-07T20:32:42.5850478Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.5850524Z 
2025-05-07T20:32:42.5850614Z moe/activation_test.py:117: 
2025-05-07T20:32:42.5850740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5850833Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.5850929Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5851428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.5851556Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.5851942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5857865Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5858230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5858322Z     kernel = self.compile(
2025-05-07T20:32:42.5858783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5858958Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5859085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5859091Z 
2025-05-07T20:32:42.5859293Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8a00940>
2025-05-07T20:32:42.5860073Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5860584Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc89fdee0>}
2025-05-07T20:32:42.5861328Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5861520Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8fd7970>
2025-05-07T20:32:42.5861525Z 
2025-05-07T20:32:42.5861685Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5861948Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5862056Z                            module_map=module_map)
2025-05-07T20:32:42.5862214Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5862313Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.5862386Z E       ^
2025-05-07T20:32:42.5862736Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5862741Z 
2025-05-07T20:32:42.5863156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5863163Z 
2025-05-07T20:32:42.5863264Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5863485Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5863556Z     T=1,
2025-05-07T20:32:42.5863627Z     D=7168,
2025-05-07T20:32:42.5863755Z     scale_ub=1200.0,
2025-05-07T20:32:42.5863835Z     contiguous=True,
2025-05-07T20:32:42.5863913Z     compiled=True,
2025-05-07T20:32:42.5863982Z )
2025-05-07T20:32:42.5864195Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5864396Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.5864402Z 
2025-05-07T20:32:42.5864478Z     @given(
2025-05-07T20:32:42.5864592Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5864690Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5864804Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5864956Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5865070Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5865139Z     )
2025-05-07T20:32:42.5865378Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5865470Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5865547Z         self,
2025-05-07T20:32:42.5865619Z         T: int,
2025-05-07T20:32:42.5865693Z         D: int,
2025-05-07T20:32:42.5865787Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5865873Z         contiguous: bool,
2025-05-07T20:32:42.5865963Z         compiled: bool,
2025-05-07T20:32:42.5866039Z     ) -> None:
2025-05-07T20:32:42.5866132Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5866200Z     
2025-05-07T20:32:42.5866376Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5866457Z     
2025-05-07T20:32:42.5866555Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5866735Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5866822Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5866899Z         x0 = x[:, :D]
2025-05-07T20:32:42.5866974Z         x1 = x[:, D:]
2025-05-07T20:32:42.5867049Z     
2025-05-07T20:32:42.5867126Z         if contiguous:
2025-05-07T20:32:42.5867213Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5867304Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5867371Z     
2025-05-07T20:32:42.5867455Z         if scale_ub is not None:
2025-05-07T20:32:42.5867557Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5867693Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5867769Z             )
2025-05-07T20:32:42.5867840Z         else:
2025-05-07T20:32:42.5867930Z             scale_ub_tensor = None
2025-05-07T20:32:42.5868001Z     
2025-05-07T20:32:42.5868126Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5868217Z             op = silu_mul_quant
2025-05-07T20:32:42.5868303Z             if compiled:
2025-05-07T20:32:42.5868401Z                 op = torch.compile(op)
2025-05-07T20:32:42.5868501Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5868572Z     
2025-05-07T20:32:42.5868657Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.5868662Z 
2025-05-07T20:32:42.5868756Z moe/activation_test.py:117: 
2025-05-07T20:32:42.5868883Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5868977Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.5869074Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5869440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.5869527Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.5870077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.5870175Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.5870528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5870749Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5871080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5871245Z     kernel = self.compile(
2025-05-07T20:32:42.5871619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5871828Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5871955Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5871959Z 
2025-05-07T20:32:42.5872164Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc89f9460>
2025-05-07T20:32:42.5872981Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5873489Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8fda940>}
2025-05-07T20:32:42.5874229Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5874420Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8e34c70>
2025-05-07T20:32:42.5874425Z 
2025-05-07T20:32:42.5874585Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5874885Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5874994Z                            module_map=module_map)
2025-05-07T20:32:42.5875153Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5875248Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.5875320Z E       ^
2025-05-07T20:32:42.5875670Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5875678Z 
2025-05-07T20:32:42.5876092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5876097Z 
2025-05-07T20:32:42.5876197Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5876418Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5876491Z     T=1,
2025-05-07T20:32:42.5876564Z     D=7168,
2025-05-07T20:32:42.5876646Z     scale_ub=1200.0,
2025-05-07T20:32:42.5876732Z     contiguous=False,
2025-05-07T20:32:42.5876814Z     compiled=True,
2025-05-07T20:32:42.5876887Z )
2025-05-07T20:32:42.5877104Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5877268Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.5877273Z 
2025-05-07T20:32:42.5877351Z     @given(
2025-05-07T20:32:42.5877466Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5877566Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5877674Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5877786Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5877902Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5877972Z     )
2025-05-07T20:32:42.5878214Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5878307Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5878379Z         self,
2025-05-07T20:32:42.5878451Z         T: int,
2025-05-07T20:32:42.5878527Z         D: int,
2025-05-07T20:32:42.5878620Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5878707Z         contiguous: bool,
2025-05-07T20:32:42.5878786Z         compiled: bool,
2025-05-07T20:32:42.5878858Z     ) -> None:
2025-05-07T20:32:42.5878950Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5879018Z     
2025-05-07T20:32:42.5879185Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5879301Z     
2025-05-07T20:32:42.5879388Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5879505Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5879591Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5879706Z         x0 = x[:, :D]
2025-05-07T20:32:42.5879782Z         x1 = x[:, D:]
2025-05-07T20:32:42.5879855Z     
2025-05-07T20:32:42.5879931Z         if contiguous:
2025-05-07T20:32:42.5880020Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5880104Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5880212Z     
2025-05-07T20:32:42.5880304Z         if scale_ub is not None:
2025-05-07T20:32:42.5880406Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5880537Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5880612Z             )
2025-05-07T20:32:42.5880682Z         else:
2025-05-07T20:32:42.5880770Z             scale_ub_tensor = None
2025-05-07T20:32:42.5880843Z     
2025-05-07T20:32:42.5880970Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5881057Z             op = silu_mul_quant
2025-05-07T20:32:42.5881137Z             if compiled:
2025-05-07T20:32:42.5881231Z                 op = torch.compile(op)
2025-05-07T20:32:42.5881337Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5881404Z     
2025-05-07T20:32:42.5881490Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.5881494Z 
2025-05-07T20:32:42.5881589Z moe/activation_test.py:117: 
2025-05-07T20:32:42.5881753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5881852Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.5881949Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5882311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.5882397Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.5882897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.5882990Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.5883348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5883570Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5883904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5883999Z     kernel = self.compile(
2025-05-07T20:32:42.5884374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5884545Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5884669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5884676Z 
2025-05-07T20:32:42.5884878Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8e1eb80>
2025-05-07T20:32:42.5885663Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5886171Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8dee5e0>}
2025-05-07T20:32:42.5886970Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5887160Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8ca6bb0>
2025-05-07T20:32:42.5887165Z 
2025-05-07T20:32:42.5887324Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5887631Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5887734Z                            module_map=module_map)
2025-05-07T20:32:42.5887933Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5888029Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.5888098Z E       ^
2025-05-07T20:32:42.5888449Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5888453Z 
2025-05-07T20:32:42.5888905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5888910Z 
2025-05-07T20:32:42.5889011Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5889231Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5889301Z     T=1,
2025-05-07T20:32:42.5889378Z     D=7168,
2025-05-07T20:32:42.5889455Z     scale_ub=None,
2025-05-07T20:32:42.5889533Z     contiguous=False,
2025-05-07T20:32:42.5889615Z     compiled=True,
2025-05-07T20:32:42.5889680Z )
2025-05-07T20:32:42.5889895Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5890058Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.5890063Z 
2025-05-07T20:32:42.5890134Z     @given(
2025-05-07T20:32:42.5890252Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5890347Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5890502Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5890618Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5890728Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5890794Z     )
2025-05-07T20:32:42.5891036Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5891127Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5891199Z         self,
2025-05-07T20:32:42.5891273Z         T: int,
2025-05-07T20:32:42.5891343Z         D: int,
2025-05-07T20:32:42.5891436Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5891524Z         contiguous: bool,
2025-05-07T20:32:42.5891606Z         compiled: bool,
2025-05-07T20:32:42.5891680Z     ) -> None:
2025-05-07T20:32:42.5891770Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5891837Z     
2025-05-07T20:32:42.5892006Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5892077Z     
2025-05-07T20:32:42.5892166Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5892286Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5892369Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5892444Z         x0 = x[:, :D]
2025-05-07T20:32:42.5892520Z         x1 = x[:, D:]
2025-05-07T20:32:42.5892588Z     
2025-05-07T20:32:42.5892666Z         if contiguous:
2025-05-07T20:32:42.5892757Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5892842Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5892912Z     
2025-05-07T20:32:42.5892998Z         if scale_ub is not None:
2025-05-07T20:32:42.5893097Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5893232Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5893304Z             )
2025-05-07T20:32:42.5893377Z         else:
2025-05-07T20:32:42.5893469Z             scale_ub_tensor = None
2025-05-07T20:32:42.5893537Z     
2025-05-07T20:32:42.5893664Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5893756Z             op = silu_mul_quant
2025-05-07T20:32:42.5893837Z             if compiled:
2025-05-07T20:32:42.5893932Z                 op = torch.compile(op)
2025-05-07T20:32:42.5894037Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5894102Z     
2025-05-07T20:32:42.5894192Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.5894356Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.5894420Z     
2025-05-07T20:32:42.5894554Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5894651Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.5894785Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.5894906Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.5895040Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.5895110Z     
2025-05-07T20:32:42.5895208Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.5895251Z 
2025-05-07T20:32:42.5895347Z moe/activation_test.py:126: 
2025-05-07T20:32:42.5895473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5895572Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.5895700Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.5896268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.5896367Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.5896751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5896997Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5897358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.5897674Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.5898068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.5898317Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.5898688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.5898854Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.5899197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.5899266Z     fn()
2025-05-07T20:32:42.5899657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.5899738Z     self.fn.run(
2025-05-07T20:32:42.5900071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5900161Z     kernel = self.compile(
2025-05-07T20:32:42.5900535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5900705Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5900832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5900836Z 
2025-05-07T20:32:42.5901038Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8ca46a0>
2025-05-07T20:32:42.5901816Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5902331Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8c44160>}
2025-05-07T20:32:42.5903074Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5903261Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8c3e430>
2025-05-07T20:32:42.5903307Z 
2025-05-07T20:32:42.5903468Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5903940Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5904201Z                            module_map=module_map)
2025-05-07T20:32:42.5904366Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5904466Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.5904540Z E       ^
2025-05-07T20:32:42.5904891Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5904959Z 
2025-05-07T20:32:42.5905375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5905379Z 
2025-05-07T20:32:42.5905477Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5905698Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5905776Z     T=1,
2025-05-07T20:32:42.5905847Z     D=5120,
2025-05-07T20:32:42.5905927Z     scale_ub=1200.0,
2025-05-07T20:32:42.5906009Z     contiguous=False,
2025-05-07T20:32:42.5906090Z     compiled=True,
2025-05-07T20:32:42.5906162Z )
2025-05-07T20:32:42.5906377Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5906538Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.5906543Z 
2025-05-07T20:32:42.5906616Z     @given(
2025-05-07T20:32:42.5906789Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5906891Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5907000Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5907111Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5907223Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5907293Z     )
2025-05-07T20:32:42.5907536Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5907630Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5907702Z         self,
2025-05-07T20:32:42.5907775Z         T: int,
2025-05-07T20:32:42.5907850Z         D: int,
2025-05-07T20:32:42.5907947Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5908030Z         contiguous: bool,
2025-05-07T20:32:42.5908113Z         compiled: bool,
2025-05-07T20:32:42.5908188Z     ) -> None:
2025-05-07T20:32:42.5908278Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5908343Z     
2025-05-07T20:32:42.5908510Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5908580Z     
2025-05-07T20:32:42.5908671Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5908791Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5908878Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5908951Z         x0 = x[:, :D]
2025-05-07T20:32:42.5909024Z         x1 = x[:, D:]
2025-05-07T20:32:42.5909096Z     
2025-05-07T20:32:42.5909172Z         if contiguous:
2025-05-07T20:32:42.5909257Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5909348Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5909415Z     
2025-05-07T20:32:42.5909508Z         if scale_ub is not None:
2025-05-07T20:32:42.5909606Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5909735Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5909877Z             )
2025-05-07T20:32:42.5909950Z         else:
2025-05-07T20:32:42.5910039Z             scale_ub_tensor = None
2025-05-07T20:32:42.5910115Z     
2025-05-07T20:32:42.5910240Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5910324Z             op = silu_mul_quant
2025-05-07T20:32:42.5910408Z             if compiled:
2025-05-07T20:32:42.5910500Z                 op = torch.compile(op)
2025-05-07T20:32:42.5910600Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5910738Z     
2025-05-07T20:32:42.5910824Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.5910828Z 
2025-05-07T20:32:42.5910923Z moe/activation_test.py:117: 
2025-05-07T20:32:42.5911046Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5911180Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.5911279Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5911641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.5911739Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.5912269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.5912367Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.5912719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5912938Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5913275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5913364Z     kernel = self.compile(
2025-05-07T20:32:42.5913738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5913912Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5914034Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5914041Z 
2025-05-07T20:32:42.5914284Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc9040340>
2025-05-07T20:32:42.5915059Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5915569Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8c44b80>}
2025-05-07T20:32:42.5916314Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5916499Z context = <triton._C.libtriton.ir.context object at 0x7f9fc872d270>
2025-05-07T20:32:42.5916504Z 
2025-05-07T20:32:42.5916668Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5916927Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5917029Z                            module_map=module_map)
2025-05-07T20:32:42.5917189Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5917284Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.5917357Z E       ^
2025-05-07T20:32:42.5917714Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5917718Z 
2025-05-07T20:32:42.5918128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5918133Z 
2025-05-07T20:32:42.5918235Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5918452Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5918524Z     T=1,
2025-05-07T20:32:42.5918598Z     D=5120,
2025-05-07T20:32:42.5918677Z     scale_ub=1200.0,
2025-05-07T20:32:42.5918761Z     contiguous=False,
2025-05-07T20:32:42.5918839Z     compiled=False,
2025-05-07T20:32:42.5918905Z )
2025-05-07T20:32:42.5919122Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5919284Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.5919330Z 
2025-05-07T20:32:42.5919405Z     @given(
2025-05-07T20:32:42.5919520Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5919612Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5919768Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5919882Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5919991Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5920064Z     )
2025-05-07T20:32:42.5920308Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5920441Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5920511Z         self,
2025-05-07T20:32:42.5920582Z         T: int,
2025-05-07T20:32:42.5920650Z         D: int,
2025-05-07T20:32:42.5920748Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5920830Z         contiguous: bool,
2025-05-07T20:32:42.5920911Z         compiled: bool,
2025-05-07T20:32:42.5920986Z     ) -> None:
2025-05-07T20:32:42.5921074Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5921145Z     
2025-05-07T20:32:42.5921307Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5921374Z     
2025-05-07T20:32:42.5921469Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5921586Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5921668Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5921746Z         x0 = x[:, :D]
2025-05-07T20:32:42.5921820Z         x1 = x[:, D:]
2025-05-07T20:32:42.5921886Z     
2025-05-07T20:32:42.5921965Z         if contiguous:
2025-05-07T20:32:42.5922095Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5922182Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5922250Z     
2025-05-07T20:32:42.5922334Z         if scale_ub is not None:
2025-05-07T20:32:42.5922436Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5922569Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5922643Z             )
2025-05-07T20:32:42.5922719Z         else:
2025-05-07T20:32:42.5922809Z             scale_ub_tensor = None
2025-05-07T20:32:42.5922877Z     
2025-05-07T20:32:42.5923004Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5923091Z             op = silu_mul_quant
2025-05-07T20:32:42.5923169Z             if compiled:
2025-05-07T20:32:42.5923267Z                 op = torch.compile(op)
2025-05-07T20:32:42.5923366Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5923430Z     
2025-05-07T20:32:42.5923519Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.5923529Z 
2025-05-07T20:32:42.5923621Z moe/activation_test.py:117: 
2025-05-07T20:32:42.5923747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5923843Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.5923936Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5924435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.5924529Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.5924879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5925104Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5925435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5925531Z     kernel = self.compile(
2025-05-07T20:32:42.5925911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5926080Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5926202Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5926207Z 
2025-05-07T20:32:42.5926407Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8c17e50>
2025-05-07T20:32:42.5927282Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5927783Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc9015550>}
2025-05-07T20:32:42.5928522Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5928776Z context = <triton._C.libtriton.ir.context object at 0x7f9fc901cd30>
2025-05-07T20:32:42.5928780Z 
2025-05-07T20:32:42.5928941Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5929202Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5929303Z                            module_map=module_map)
2025-05-07T20:32:42.5929459Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5929560Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.5929633Z E       ^
2025-05-07T20:32:42.5929981Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5929990Z 
2025-05-07T20:32:42.5930438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5930446Z 
2025-05-07T20:32:42.5930544Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5930764Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5930839Z     T=16384,
2025-05-07T20:32:42.5930911Z     D=5120,
2025-05-07T20:32:42.5930997Z     scale_ub=1200.0,
2025-05-07T20:32:42.5931075Z     contiguous=False,
2025-05-07T20:32:42.5931151Z     compiled=True,
2025-05-07T20:32:42.5931225Z )
2025-05-07T20:32:42.5931439Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5931616Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.5931621Z 
2025-05-07T20:32:42.5931692Z     @given(
2025-05-07T20:32:42.5931805Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5931899Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5932011Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5932128Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5932239Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5932309Z     )
2025-05-07T20:32:42.5932552Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5932638Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5932711Z         self,
2025-05-07T20:32:42.5932785Z         T: int,
2025-05-07T20:32:42.5932856Z         D: int,
2025-05-07T20:32:42.5932950Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5933038Z         contiguous: bool,
2025-05-07T20:32:42.5933121Z         compiled: bool,
2025-05-07T20:32:42.5933193Z     ) -> None:
2025-05-07T20:32:42.5933286Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5933354Z     
2025-05-07T20:32:42.5933516Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5933589Z     
2025-05-07T20:32:42.5933679Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5933800Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5933886Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5933959Z         x0 = x[:, :D]
2025-05-07T20:32:42.5934034Z         x1 = x[:, D:]
2025-05-07T20:32:42.5934099Z     
2025-05-07T20:32:42.5934175Z         if contiguous:
2025-05-07T20:32:42.5934264Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5934396Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5934461Z     
2025-05-07T20:32:42.5934553Z         if scale_ub is not None:
2025-05-07T20:32:42.5934653Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5934820Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5934896Z             )
2025-05-07T20:32:42.5934966Z         else:
2025-05-07T20:32:42.5935055Z             scale_ub_tensor = None
2025-05-07T20:32:42.5935126Z     
2025-05-07T20:32:42.5935249Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5935379Z             op = silu_mul_quant
2025-05-07T20:32:42.5935458Z             if compiled:
2025-05-07T20:32:42.5935553Z                 op = torch.compile(op)
2025-05-07T20:32:42.5935655Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5935720Z     
2025-05-07T20:32:42.5935805Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.5935809Z 
2025-05-07T20:32:42.5935908Z moe/activation_test.py:117: 
2025-05-07T20:32:42.5936031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5936125Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.5936223Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5936586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.5936678Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.5937163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.5937849Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.5938212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5938429Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5938760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5938855Z     kernel = self.compile(
2025-05-07T20:32:42.5939227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5939401Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5939522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5939527Z 
2025-05-07T20:32:42.5939728Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8c19b50>
2025-05-07T20:32:42.5940510Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5941014Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8cee1f0>}
2025-05-07T20:32:42.5941761Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5941947Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8d04b70>
2025-05-07T20:32:42.5941951Z 
2025-05-07T20:32:42.5942113Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5942375Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5942481Z                            module_map=module_map)
2025-05-07T20:32:42.5942645Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5942737Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.5942806Z E       ^
2025-05-07T20:32:42.5943157Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5943207Z 
2025-05-07T20:32:42.5943616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5943621Z 
2025-05-07T20:32:42.5943761Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5943979Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5944049Z     T=2048,
2025-05-07T20:32:42.5944122Z     D=7168,
2025-05-07T20:32:42.5944199Z     scale_ub=1200.0,
2025-05-07T20:32:42.5944279Z     contiguous=False,
2025-05-07T20:32:42.5944401Z     compiled=True,
2025-05-07T20:32:42.5944468Z )
2025-05-07T20:32:42.5944679Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5944849Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.5944853Z 
2025-05-07T20:32:42.5944926Z     @given(
2025-05-07T20:32:42.5945042Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5945137Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5945246Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5945360Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5945471Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5945539Z     )
2025-05-07T20:32:42.5945781Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5945871Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5945944Z         self,
2025-05-07T20:32:42.5946017Z         T: int,
2025-05-07T20:32:42.5946125Z         D: int,
2025-05-07T20:32:42.5946222Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5946305Z         contiguous: bool,
2025-05-07T20:32:42.5946385Z         compiled: bool,
2025-05-07T20:32:42.5946461Z     ) -> None:
2025-05-07T20:32:42.5946549Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5946617Z     
2025-05-07T20:32:42.5946786Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5946853Z     
2025-05-07T20:32:42.5946936Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5947059Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5947143Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5947218Z         x0 = x[:, :D]
2025-05-07T20:32:42.5947300Z         x1 = x[:, D:]
2025-05-07T20:32:42.5947367Z     
2025-05-07T20:32:42.5947451Z         if contiguous:
2025-05-07T20:32:42.5947537Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5947623Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5947700Z     
2025-05-07T20:32:42.5947786Z         if scale_ub is not None:
2025-05-07T20:32:42.5947886Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5948020Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5948094Z             )
2025-05-07T20:32:42.5948168Z         else:
2025-05-07T20:32:42.5948260Z             scale_ub_tensor = None
2025-05-07T20:32:42.5948326Z     
2025-05-07T20:32:42.5948451Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5948537Z             op = silu_mul_quant
2025-05-07T20:32:42.5948616Z             if compiled:
2025-05-07T20:32:42.5948712Z                 op = torch.compile(op)
2025-05-07T20:32:42.5948812Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5948880Z     
2025-05-07T20:32:42.5948969Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.5948973Z 
2025-05-07T20:32:42.5949065Z moe/activation_test.py:117: 
2025-05-07T20:32:42.5949193Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5949295Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.5949389Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5949748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.5949889Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.5950422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.5950516Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.5950906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5951127Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5951461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5951590Z     kernel = self.compile(
2025-05-07T20:32:42.5951965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5952133Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5952253Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5952260Z 
2025-05-07T20:32:42.5952460Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8ce7b80>
2025-05-07T20:32:42.5953236Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5953752Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8ceeee0>}
2025-05-07T20:32:42.5954536Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5954725Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8d5c3f0>
2025-05-07T20:32:42.5954729Z 
2025-05-07T20:32:42.5954892Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5955153Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5955257Z                            module_map=module_map)
2025-05-07T20:32:42.5955416Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5955507Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.5955584Z E       ^
2025-05-07T20:32:42.5955933Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5955941Z 
2025-05-07T20:32:42.5956349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5956357Z 
2025-05-07T20:32:42.5956455Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5956672Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5956746Z     T=1,
2025-05-07T20:32:42.5956817Z     D=5120,
2025-05-07T20:32:42.5956894Z     scale_ub=None,
2025-05-07T20:32:42.5956978Z     contiguous=False,
2025-05-07T20:32:42.5957058Z     compiled=False,
2025-05-07T20:32:42.5957127Z )
2025-05-07T20:32:42.5957347Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5957508Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.5957512Z 
2025-05-07T20:32:42.5957586Z     @given(
2025-05-07T20:32:42.5957697Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5957793Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5957913Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5958025Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5958133Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5958205Z     )
2025-05-07T20:32:42.5958446Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5958580Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5958657Z         self,
2025-05-07T20:32:42.5958730Z         T: int,
2025-05-07T20:32:42.5958800Z         D: int,
2025-05-07T20:32:42.5958900Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5959044Z         contiguous: bool,
2025-05-07T20:32:42.5959130Z         compiled: bool,
2025-05-07T20:32:42.5959201Z     ) -> None:
2025-05-07T20:32:42.5959291Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5959358Z     
2025-05-07T20:32:42.5959524Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5959632Z     
2025-05-07T20:32:42.5959722Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5959840Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5959925Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5960004Z         x0 = x[:, :D]
2025-05-07T20:32:42.5960078Z         x1 = x[:, D:]
2025-05-07T20:32:42.5960145Z     
2025-05-07T20:32:42.5960227Z         if contiguous:
2025-05-07T20:32:42.5960317Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5960406Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5960472Z     
2025-05-07T20:32:42.5960559Z         if scale_ub is not None:
2025-05-07T20:32:42.5960664Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5960791Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5960861Z             )
2025-05-07T20:32:42.5960940Z         else:
2025-05-07T20:32:42.5961028Z             scale_ub_tensor = None
2025-05-07T20:32:42.5961096Z     
2025-05-07T20:32:42.5961266Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5961355Z             op = silu_mul_quant
2025-05-07T20:32:42.5961435Z             if compiled:
2025-05-07T20:32:42.5961533Z                 op = torch.compile(op)
2025-05-07T20:32:42.5961634Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5961703Z     
2025-05-07T20:32:42.5961789Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.5961797Z 
2025-05-07T20:32:42.5961888Z moe/activation_test.py:117: 
2025-05-07T20:32:42.5962013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5962112Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.5962205Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5962702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.5962793Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.5963149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5963371Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5963707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5963797Z     kernel = self.compile(
2025-05-07T20:32:42.5964174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5964343Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5964469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5964473Z 
2025-05-07T20:32:42.5964674Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8cd8a60>
2025-05-07T20:32:42.5965450Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5965990Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8d595e0>}
2025-05-07T20:32:42.5966747Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5966984Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8b672f0>
2025-05-07T20:32:42.5966988Z 
2025-05-07T20:32:42.5967186Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5967448Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5967550Z                            module_map=module_map)
2025-05-07T20:32:42.5967708Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5967845Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.5967918Z E       ^
2025-05-07T20:32:42.5968280Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5968284Z 
2025-05-07T20:32:42.5968692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5968698Z 
2025-05-07T20:32:42.5968796Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5969016Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5969090Z     T=4096,
2025-05-07T20:32:42.5969169Z     D=7168,
2025-05-07T20:32:42.5969247Z     scale_ub=1200.0,
2025-05-07T20:32:42.5969329Z     contiguous=False,
2025-05-07T20:32:42.5969411Z     compiled=False,
2025-05-07T20:32:42.5969478Z )
2025-05-07T20:32:42.5969734Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5969910Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.5969915Z 
2025-05-07T20:32:42.5969987Z     @given(
2025-05-07T20:32:42.5970102Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5970200Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5970309Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5970428Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5970539Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5970609Z     )
2025-05-07T20:32:42.5970854Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5970942Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5971013Z         self,
2025-05-07T20:32:42.5971084Z         T: int,
2025-05-07T20:32:42.5971156Z         D: int,
2025-05-07T20:32:42.5971248Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5971340Z         contiguous: bool,
2025-05-07T20:32:42.5971418Z         compiled: bool,
2025-05-07T20:32:42.5971491Z     ) -> None:
2025-05-07T20:32:42.5971584Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5971653Z     
2025-05-07T20:32:42.5971815Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5971885Z     
2025-05-07T20:32:42.5971971Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5972096Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5972182Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5972259Z         x0 = x[:, :D]
2025-05-07T20:32:42.5972340Z         x1 = x[:, D:]
2025-05-07T20:32:42.5972405Z     
2025-05-07T20:32:42.5972483Z         if contiguous:
2025-05-07T20:32:42.5972573Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5972657Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5973083Z     
2025-05-07T20:32:42.5973177Z         if scale_ub is not None:
2025-05-07T20:32:42.5973276Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5973415Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5973487Z             )
2025-05-07T20:32:42.5973559Z         else:
2025-05-07T20:32:42.5973649Z             scale_ub_tensor = None
2025-05-07T20:32:42.5973713Z     
2025-05-07T20:32:42.5973836Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5973974Z             op = silu_mul_quant
2025-05-07T20:32:42.5974053Z             if compiled:
2025-05-07T20:32:42.5974145Z                 op = torch.compile(op)
2025-05-07T20:32:42.5979845Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5979926Z     
2025-05-07T20:32:42.5980097Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.5980103Z 
2025-05-07T20:32:42.5980203Z moe/activation_test.py:117: 
2025-05-07T20:32:42.5980333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5980431Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.5980577Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5981079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.5981178Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.5981531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5981753Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5982093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5982187Z     kernel = self.compile(
2025-05-07T20:32:42.5982565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5982740Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5982912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5982920Z 
2025-05-07T20:32:42.5983128Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8909d90>
2025-05-07T20:32:42.5983905Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5984412Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc86f51f0>}
2025-05-07T20:32:42.5985156Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5985341Z context = <triton._C.libtriton.ir.context object at 0x7f9fc86cd670>
2025-05-07T20:32:42.5985348Z 
2025-05-07T20:32:42.5985512Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5985770Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5985876Z                            module_map=module_map)
2025-05-07T20:32:42.5986034Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5986130Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.5986206Z E       ^
2025-05-07T20:32:42.5986555Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5986560Z 
2025-05-07T20:32:42.5986970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5986975Z 
2025-05-07T20:32:42.5987076Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5987292Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5987375Z     T=16384,
2025-05-07T20:32:42.5987447Z     D=7168,
2025-05-07T20:32:42.5987523Z     scale_ub=None,
2025-05-07T20:32:42.5987604Z     contiguous=True,
2025-05-07T20:32:42.5987682Z     compiled=True,
2025-05-07T20:32:42.5987749Z )
2025-05-07T20:32:42.5987962Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5988174Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.5988178Z 
2025-05-07T20:32:42.5988253Z     @given(
2025-05-07T20:32:42.5988374Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5988469Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5988622Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5988737Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5988847Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5988915Z     )
2025-05-07T20:32:42.5989160Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5989288Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5989360Z         self,
2025-05-07T20:32:42.5989432Z         T: int,
2025-05-07T20:32:42.5989503Z         D: int,
2025-05-07T20:32:42.5989600Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5989686Z         contiguous: bool,
2025-05-07T20:32:42.5989770Z         compiled: bool,
2025-05-07T20:32:42.5989902Z     ) -> None:
2025-05-07T20:32:42.5989991Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5990058Z     
2025-05-07T20:32:42.5990221Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5990289Z     
2025-05-07T20:32:42.5990382Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5990502Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5990588Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5990664Z         x0 = x[:, :D]
2025-05-07T20:32:42.5990737Z         x1 = x[:, D:]
2025-05-07T20:32:42.5990805Z     
2025-05-07T20:32:42.5990933Z         if contiguous:
2025-05-07T20:32:42.5991023Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5991105Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5991177Z     
2025-05-07T20:32:42.5991261Z         if scale_ub is not None:
2025-05-07T20:32:42.5991361Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5991493Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5991566Z             )
2025-05-07T20:32:42.5991642Z         else:
2025-05-07T20:32:42.5991731Z             scale_ub_tensor = None
2025-05-07T20:32:42.5991799Z     
2025-05-07T20:32:42.5991930Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5992014Z             op = silu_mul_quant
2025-05-07T20:32:42.5992093Z             if compiled:
2025-05-07T20:32:42.5992190Z                 op = torch.compile(op)
2025-05-07T20:32:42.5992290Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5992359Z     
2025-05-07T20:32:42.5992452Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.5992457Z 
2025-05-07T20:32:42.5992552Z moe/activation_test.py:117: 
2025-05-07T20:32:42.5992680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5992777Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.5992872Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5993240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.5993332Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.5993827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.5993921Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.5994270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5994496Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5994836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5994923Z     kernel = self.compile(
2025-05-07T20:32:42.5995299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5995543Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5995689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5995694Z 
2025-05-07T20:32:42.5995965Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc86fc700>
2025-05-07T20:32:42.5996746Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5997254Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc86f5ee0>}
2025-05-07T20:32:42.5998028Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5998219Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8ebf930>
2025-05-07T20:32:42.5998223Z 
2025-05-07T20:32:42.5998387Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5998645Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5998749Z                            module_map=module_map)
2025-05-07T20:32:42.5998907Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5999000Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.5999080Z E       ^
2025-05-07T20:32:42.5999473Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5999479Z 
2025-05-07T20:32:42.5999890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5999895Z 
2025-05-07T20:32:42.5999996Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6000212Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6000287Z     T=4096,
2025-05-07T20:32:42.6000356Z     D=5120,
2025-05-07T20:32:42.6000433Z     scale_ub=None,
2025-05-07T20:32:42.6000524Z     contiguous=False,
2025-05-07T20:32:42.6000604Z     compiled=True,
2025-05-07T20:32:42.6000672Z )
2025-05-07T20:32:42.6000885Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6001051Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.6001058Z 
2025-05-07T20:32:42.6001137Z     @given(
2025-05-07T20:32:42.6001251Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6001343Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6001455Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6001568Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6001678Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6001751Z     )
2025-05-07T20:32:42.6001991Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6002077Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6002157Z         self,
2025-05-07T20:32:42.6002226Z         T: int,
2025-05-07T20:32:42.6002302Z         D: int,
2025-05-07T20:32:42.6002397Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6002481Z         contiguous: bool,
2025-05-07T20:32:42.6002564Z         compiled: bool,
2025-05-07T20:32:42.6002636Z     ) -> None:
2025-05-07T20:32:42.6002733Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6002805Z     
2025-05-07T20:32:42.6002971Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6003039Z     
2025-05-07T20:32:42.6003134Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6003250Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6003333Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6003457Z         x0 = x[:, :D]
2025-05-07T20:32:42.6003533Z         x1 = x[:, D:]
2025-05-07T20:32:42.6003602Z     
2025-05-07T20:32:42.6003681Z         if contiguous:
2025-05-07T20:32:42.6003980Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6004212Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6004314Z     
2025-05-07T20:32:42.6004437Z         if scale_ub is not None:
2025-05-07T20:32:42.6004556Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6004687Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6004759Z             )
2025-05-07T20:32:42.6004913Z         else:
2025-05-07T20:32:42.6005001Z             scale_ub_tensor = None
2025-05-07T20:32:42.6005066Z     
2025-05-07T20:32:42.6005197Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6005281Z             op = silu_mul_quant
2025-05-07T20:32:42.6005360Z             if compiled:
2025-05-07T20:32:42.6005459Z                 op = torch.compile(op)
2025-05-07T20:32:42.6005565Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6005649Z     
2025-05-07T20:32:42.6005743Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6005748Z 
2025-05-07T20:32:42.6005861Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6005995Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6006091Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6006185Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6006610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6006703Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6007197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6007289Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6007639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6007863Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6008198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6008285Z     kernel = self.compile(
2025-05-07T20:32:42.6008662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6008831Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6008961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6008966Z 
2025-05-07T20:32:42.6009167Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8eb7610>
2025-05-07T20:32:42.6009942Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6010455Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8ebe940>}
2025-05-07T20:32:42.6011193Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6011384Z context = <triton._C.libtriton.ir.context object at 0x7f9fc88b3970>
2025-05-07T20:32:42.6011392Z 
2025-05-07T20:32:42.6011552Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6011814Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6011917Z                            module_map=module_map)
2025-05-07T20:32:42.6012075Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6012239Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6012313Z E       ^
2025-05-07T20:32:42.6012662Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6012705Z 
2025-05-07T20:32:42.6013118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6013123Z 
2025-05-07T20:32:42.6013220Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6013447Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6013554Z     T=4096,
2025-05-07T20:32:42.6013626Z     D=5120,
2025-05-07T20:32:42.6013705Z     scale_ub=1200.0,
2025-05-07T20:32:42.6013786Z     contiguous=False,
2025-05-07T20:32:42.6013864Z     compiled=False,
2025-05-07T20:32:42.6013934Z )
2025-05-07T20:32:42.6014147Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6014320Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.6014325Z 
2025-05-07T20:32:42.6014398Z     @given(
2025-05-07T20:32:42.6014511Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6014613Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6014721Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6014832Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6014945Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6015017Z     )
2025-05-07T20:32:42.6015302Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6015395Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6015463Z         self,
2025-05-07T20:32:42.6015533Z         T: int,
2025-05-07T20:32:42.6015607Z         D: int,
2025-05-07T20:32:42.6015700Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6015787Z         contiguous: bool,
2025-05-07T20:32:42.6015869Z         compiled: bool,
2025-05-07T20:32:42.6015958Z     ) -> None:
2025-05-07T20:32:42.6016058Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6016139Z     
2025-05-07T20:32:42.6016315Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6016389Z     
2025-05-07T20:32:42.6016473Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6016591Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6016680Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6016755Z         x0 = x[:, :D]
2025-05-07T20:32:42.6016836Z         x1 = x[:, D:]
2025-05-07T20:32:42.6016904Z     
2025-05-07T20:32:42.6016983Z         if contiguous:
2025-05-07T20:32:42.6017068Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6017156Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6017225Z     
2025-05-07T20:32:42.6017316Z         if scale_ub is not None:
2025-05-07T20:32:42.6017416Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6017549Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6017624Z             )
2025-05-07T20:32:42.6017694Z         else:
2025-05-07T20:32:42.6017784Z             scale_ub_tensor = None
2025-05-07T20:32:42.6017857Z     
2025-05-07T20:32:42.6017983Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6018067Z             op = silu_mul_quant
2025-05-07T20:32:42.6018151Z             if compiled:
2025-05-07T20:32:42.6018244Z                 op = torch.compile(op)
2025-05-07T20:32:42.6018345Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6018423Z     
2025-05-07T20:32:42.6018509Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6018514Z 
2025-05-07T20:32:42.6018608Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6018730Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6018828Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6018973Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6019469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6019564Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6019958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6020178Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6020517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6020643Z     kernel = self.compile(
2025-05-07T20:32:42.6021015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6021186Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6021307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6021314Z 
2025-05-07T20:32:42.6021517Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc87971f0>
2025-05-07T20:32:42.6022291Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6022856Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc892d3a0>}
2025-05-07T20:32:42.6023601Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6023787Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8936930>
2025-05-07T20:32:42.6023794Z 
2025-05-07T20:32:42.6023956Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6024212Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6024316Z                            module_map=module_map)
2025-05-07T20:32:42.6024479Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6024572Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6024640Z E       ^
2025-05-07T20:32:42.6024992Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6025002Z 
2025-05-07T20:32:42.6025408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6025413Z 
2025-05-07T20:32:42.6025514Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6025732Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6025810Z     T=4096,
2025-05-07T20:32:42.6025884Z     D=5120,
2025-05-07T20:32:42.6025959Z     scale_ub=1200.0,
2025-05-07T20:32:42.6026045Z     contiguous=False,
2025-05-07T20:32:42.6026122Z     compiled=True,
2025-05-07T20:32:42.6026190Z )
2025-05-07T20:32:42.6026409Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6026580Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.6026584Z 
2025-05-07T20:32:42.6026657Z     @given(
2025-05-07T20:32:42.6026774Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6026873Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6026984Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6027098Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6027207Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6027280Z     )
2025-05-07T20:32:42.6027520Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6027655Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6027726Z         self,
2025-05-07T20:32:42.6027796Z         T: int,
2025-05-07T20:32:42.6027867Z         D: int,
2025-05-07T20:32:42.6028002Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6028087Z         contiguous: bool,
2025-05-07T20:32:42.6028169Z         compiled: bool,
2025-05-07T20:32:42.6028245Z     ) -> None:
2025-05-07T20:32:42.6028335Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6028403Z     
2025-05-07T20:32:42.6028572Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6028676Z     
2025-05-07T20:32:42.6028766Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6028884Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6028970Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6029048Z         x0 = x[:, :D]
2025-05-07T20:32:42.6029122Z         x1 = x[:, D:]
2025-05-07T20:32:42.6029190Z     
2025-05-07T20:32:42.6029271Z         if contiguous:
2025-05-07T20:32:42.6029357Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6029443Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6029513Z     
2025-05-07T20:32:42.6029599Z         if scale_ub is not None:
2025-05-07T20:32:42.6029699Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6029883Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6029955Z             )
2025-05-07T20:32:42.6030024Z         else:
2025-05-07T20:32:42.6030116Z             scale_ub_tensor = None
2025-05-07T20:32:42.6030189Z     
2025-05-07T20:32:42.6030360Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6030448Z             op = silu_mul_quant
2025-05-07T20:32:42.6030532Z             if compiled:
2025-05-07T20:32:42.6030632Z                 op = torch.compile(op)
2025-05-07T20:32:42.6030730Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6030798Z     
2025-05-07T20:32:42.6030897Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6030901Z 
2025-05-07T20:32:42.6030992Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6031113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6031219Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6031311Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6031677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6031764Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6032255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6032353Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6032703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6032923Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6033260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6033356Z     kernel = self.compile(
2025-05-07T20:32:42.6033734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6033905Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6034028Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6034032Z 
2025-05-07T20:32:42.6034243Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc87b0400>
2025-05-07T20:32:42.6035021Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6035567Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc892d280>}
2025-05-07T20:32:42.6036390Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6036588Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8a487f0>
2025-05-07T20:32:42.6036592Z 
2025-05-07T20:32:42.6036754Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6037054Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6037155Z                            module_map=module_map)
2025-05-07T20:32:42.6037312Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6037408Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6037484Z E       ^
2025-05-07T20:32:42.6037841Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6037852Z 
2025-05-07T20:32:42.6038260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6038265Z 
2025-05-07T20:32:42.6038362Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6038580Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6038656Z     T=2048,
2025-05-07T20:32:42.6038728Z     D=7168,
2025-05-07T20:32:42.6038852Z     scale_ub=1200.0,
2025-05-07T20:32:42.6038936Z     contiguous=False,
2025-05-07T20:32:42.6039014Z     compiled=False,
2025-05-07T20:32:42.6039084Z )
2025-05-07T20:32:42.6039301Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6039474Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.6039482Z 
2025-05-07T20:32:42.6039555Z     @given(
2025-05-07T20:32:42.6039668Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6039765Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6039873Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6039990Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6040105Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6040174Z     )
2025-05-07T20:32:42.6040416Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6040511Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6040582Z         self,
2025-05-07T20:32:42.6040656Z         T: int,
2025-05-07T20:32:42.6040728Z         D: int,
2025-05-07T20:32:42.6040819Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6040906Z         contiguous: bool,
2025-05-07T20:32:42.6040988Z         compiled: bool,
2025-05-07T20:32:42.6041061Z     ) -> None:
2025-05-07T20:32:42.6041157Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6041225Z     
2025-05-07T20:32:42.6041387Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6041458Z     
2025-05-07T20:32:42.6041543Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6041661Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6041748Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6041822Z         x0 = x[:, :D]
2025-05-07T20:32:42.6041896Z         x1 = x[:, D:]
2025-05-07T20:32:42.6041962Z     
2025-05-07T20:32:42.6042042Z         if contiguous:
2025-05-07T20:32:42.6042137Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6042221Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6042287Z     
2025-05-07T20:32:42.6042372Z         if scale_ub is not None:
2025-05-07T20:32:42.6042473Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6042600Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6042724Z             )
2025-05-07T20:32:42.6042796Z         else:
2025-05-07T20:32:42.6042885Z             scale_ub_tensor = None
2025-05-07T20:32:42.6042955Z     
2025-05-07T20:32:42.6043079Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6043165Z             op = silu_mul_quant
2025-05-07T20:32:42.6043286Z             if compiled:
2025-05-07T20:32:42.6043384Z                 op = torch.compile(op)
2025-05-07T20:32:42.6043489Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6043555Z     
2025-05-07T20:32:42.6043640Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6043644Z 
2025-05-07T20:32:42.6043781Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6043903Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6043997Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6044093Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6044585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6044681Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6045033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6045252Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6045588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6045676Z     kernel = self.compile(
2025-05-07T20:32:42.6046138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6046315Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6046435Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6046439Z 
2025-05-07T20:32:42.6046650Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc891f2e0>
2025-05-07T20:32:42.6047428Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6047935Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8a0e670>}
2025-05-07T20:32:42.6048675Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6048871Z context = <triton._C.libtriton.ir.context object at 0x7f9fc882dab0>
2025-05-07T20:32:42.6048876Z 
2025-05-07T20:32:42.6049038Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6049298Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6049404Z                            module_map=module_map)
2025-05-07T20:32:42.6049560Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6049655Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6049730Z E       ^
2025-05-07T20:32:42.6050079Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6050083Z 
2025-05-07T20:32:42.6050491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6050500Z 
2025-05-07T20:32:42.6050597Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6050814Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6050889Z     T=1,
2025-05-07T20:32:42.6050959Z     D=7168,
2025-05-07T20:32:42.6051037Z     scale_ub=None,
2025-05-07T20:32:42.6051166Z     contiguous=True,
2025-05-07T20:32:42.6051244Z     compiled=False,
2025-05-07T20:32:42.6051308Z )
2025-05-07T20:32:42.6051526Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6051724Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.6051730Z 
2025-05-07T20:32:42.6051803Z     @given(
2025-05-07T20:32:42.6051915Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6052009Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6052126Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6052305Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6052413Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6052485Z     )
2025-05-07T20:32:42.6052724Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6052813Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6052887Z         self,
2025-05-07T20:32:42.6052959Z         T: int,
2025-05-07T20:32:42.6053028Z         D: int,
2025-05-07T20:32:42.6053126Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6053209Z         contiguous: bool,
2025-05-07T20:32:42.6053290Z         compiled: bool,
2025-05-07T20:32:42.6053367Z     ) -> None:
2025-05-07T20:32:42.6053457Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6053526Z     
2025-05-07T20:32:42.6053689Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6053760Z     
2025-05-07T20:32:42.6053850Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6054011Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6054098Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6054175Z         x0 = x[:, :D]
2025-05-07T20:32:42.6054250Z         x1 = x[:, D:]
2025-05-07T20:32:42.6054317Z     
2025-05-07T20:32:42.6054400Z         if contiguous:
2025-05-07T20:32:42.6054484Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6054571Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6054641Z     
2025-05-07T20:32:42.6054724Z         if scale_ub is not None:
2025-05-07T20:32:42.6054827Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6054958Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6055029Z             )
2025-05-07T20:32:42.6055105Z         else:
2025-05-07T20:32:42.6055194Z             scale_ub_tensor = None
2025-05-07T20:32:42.6055263Z     
2025-05-07T20:32:42.6055392Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6055476Z             op = silu_mul_quant
2025-05-07T20:32:42.6055559Z             if compiled:
2025-05-07T20:32:42.6055657Z                 op = torch.compile(op)
2025-05-07T20:32:42.6055759Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6055825Z     
2025-05-07T20:32:42.6055913Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6055918Z 
2025-05-07T20:32:42.6056008Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6056135Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6056232Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6056325Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6056830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6056922Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6057273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6057505Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6057839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6057930Z     kernel = self.compile(
2025-05-07T20:32:42.6058303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6058521Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6058646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6058651Z 
2025-05-07T20:32:42.6058889Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8a164c0>
2025-05-07T20:32:42.6059665Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6060210Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8962280>}
2025-05-07T20:32:42.6060944Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6061133Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8985530>
2025-05-07T20:32:42.6061138Z 
2025-05-07T20:32:42.6061296Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6061558Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6061662Z                            module_map=module_map)
2025-05-07T20:32:42.6061819Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6061951Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6062024Z E       ^
2025-05-07T20:32:42.6062383Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6062388Z 
2025-05-07T20:32:42.6062792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6062800Z 
2025-05-07T20:32:42.6062896Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6063114Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6063187Z     T=16384,
2025-05-07T20:32:42.6063260Z     D=7168,
2025-05-07T20:32:42.6063343Z     scale_ub=1200.0,
2025-05-07T20:32:42.6063426Z     contiguous=False,
2025-05-07T20:32:42.6063505Z     compiled=True,
2025-05-07T20:32:42.6063574Z )
2025-05-07T20:32:42.6063790Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6063970Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.6063977Z 
2025-05-07T20:32:42.6064050Z     @given(
2025-05-07T20:32:42.6064162Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6064261Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6064370Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6064479Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6064598Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6064667Z     )
2025-05-07T20:32:42.6064910Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6065000Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6065069Z         self,
2025-05-07T20:32:42.6065143Z         T: int,
2025-05-07T20:32:42.6065212Z         D: int,
2025-05-07T20:32:42.6065304Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6065389Z         contiguous: bool,
2025-05-07T20:32:42.6065468Z         compiled: bool,
2025-05-07T20:32:42.6065546Z     ) -> None:
2025-05-07T20:32:42.6065637Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6065704Z     
2025-05-07T20:32:42.6065868Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6065935Z     
2025-05-07T20:32:42.6066033Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6066166Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6066317Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6066393Z         x0 = x[:, :D]
2025-05-07T20:32:42.6066468Z         x1 = x[:, D:]
2025-05-07T20:32:42.6066535Z     
2025-05-07T20:32:42.6066612Z         if contiguous:
2025-05-07T20:32:42.6066739Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6066824Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6066888Z     
2025-05-07T20:32:42.6066974Z         if scale_ub is not None:
2025-05-07T20:32:42.6067075Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6067207Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6067321Z             )
2025-05-07T20:32:42.6067394Z         else:
2025-05-07T20:32:42.6067486Z             scale_ub_tensor = None
2025-05-07T20:32:42.6067551Z     
2025-05-07T20:32:42.6067675Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6067761Z             op = silu_mul_quant
2025-05-07T20:32:42.6067840Z             if compiled:
2025-05-07T20:32:42.6067940Z                 op = torch.compile(op)
2025-05-07T20:32:42.6068046Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6068114Z     
2025-05-07T20:32:42.6068205Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6068209Z 
2025-05-07T20:32:42.6068307Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6068431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6068530Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6068625Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6069026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6069119Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6069606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6069699Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6070096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6070317Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6070656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6070745Z     kernel = self.compile(
2025-05-07T20:32:42.6071118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6071293Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6071415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6071420Z 
2025-05-07T20:32:42.6071621Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc897bd90>
2025-05-07T20:32:42.6072397Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6072901Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8962ee0>}
2025-05-07T20:32:42.6073643Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6073833Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8863330>
2025-05-07T20:32:42.6073838Z 
2025-05-07T20:32:42.6073999Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6074256Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6074357Z                            module_map=module_map)
2025-05-07T20:32:42.6074565Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6074657Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6074730Z E       ^
2025-05-07T20:32:42.6075126Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6075131Z 
2025-05-07T20:32:42.6075543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6075548Z 
2025-05-07T20:32:42.6075649Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6075907Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6075975Z     T=1,
2025-05-07T20:32:42.6076049Z     D=7168,
2025-05-07T20:32:42.6076121Z     scale_ub=None,
2025-05-07T20:32:42.6076205Z     contiguous=False,
2025-05-07T20:32:42.6076307Z     compiled=False,
2025-05-07T20:32:42.6076376Z )
2025-05-07T20:32:42.6076616Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6076778Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.6076782Z 
2025-05-07T20:32:42.6076855Z     @given(
2025-05-07T20:32:42.6076976Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6077071Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6077179Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6077295Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6077403Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6077517Z     )
2025-05-07T20:32:42.6077762Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6077850Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6077921Z         self,
2025-05-07T20:32:42.6077992Z         T: int,
2025-05-07T20:32:42.6078062Z         D: int,
2025-05-07T20:32:42.6078158Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6078244Z         contiguous: bool,
2025-05-07T20:32:42.6078324Z         compiled: bool,
2025-05-07T20:32:42.6078397Z     ) -> None:
2025-05-07T20:32:42.6078487Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6078558Z     
2025-05-07T20:32:42.6078728Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6078799Z     
2025-05-07T20:32:42.6078887Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6079007Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6079089Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6079171Z         x0 = x[:, :D]
2025-05-07T20:32:42.6079245Z         x1 = x[:, D:]
2025-05-07T20:32:42.6079310Z     
2025-05-07T20:32:42.6079390Z         if contiguous:
2025-05-07T20:32:42.6079476Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6079561Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6079631Z     
2025-05-07T20:32:42.6079717Z         if scale_ub is not None:
2025-05-07T20:32:42.6079819Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6079955Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6080025Z             )
2025-05-07T20:32:42.6080098Z         else:
2025-05-07T20:32:42.6080194Z             scale_ub_tensor = None
2025-05-07T20:32:42.6080261Z     
2025-05-07T20:32:42.6080387Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6080472Z             op = silu_mul_quant
2025-05-07T20:32:42.6080552Z             if compiled:
2025-05-07T20:32:42.6080648Z                 op = torch.compile(op)
2025-05-07T20:32:42.6080755Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6080822Z     
2025-05-07T20:32:42.6080910Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6080914Z 
2025-05-07T20:32:42.6081003Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6081125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6081225Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6081367Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6081865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6082017Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6082372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6082590Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6082926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6083051Z     kernel = self.compile(
2025-05-07T20:32:42.6083433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6083601Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6083727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6083731Z 
2025-05-07T20:32:42.6083930Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc887a520>
2025-05-07T20:32:42.6084704Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6085255Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc887b670>}
2025-05-07T20:32:42.6086026Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6086235Z context = <triton._C.libtriton.ir.context object at 0x7f9fc85c8f70>
2025-05-07T20:32:42.6086243Z 
2025-05-07T20:32:42.6086402Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6086663Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6086765Z                            module_map=module_map)
2025-05-07T20:32:42.6086923Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6087019Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6087092Z E       ^
2025-05-07T20:32:42.6087442Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6087450Z 
2025-05-07T20:32:42.6087858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6087863Z 
2025-05-07T20:32:42.6087957Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6088175Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6088251Z     T=2048,
2025-05-07T20:32:42.6088322Z     D=7168,
2025-05-07T20:32:42.6088400Z     scale_ub=None,
2025-05-07T20:32:42.6088479Z     contiguous=False,
2025-05-07T20:32:42.6088556Z     compiled=True,
2025-05-07T20:32:42.6088628Z )
2025-05-07T20:32:42.6088843Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6089010Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.6089014Z 
2025-05-07T20:32:42.6089087Z     @given(
2025-05-07T20:32:42.6089203Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6089300Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6089410Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6089522Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6089634Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6089705Z     )
2025-05-07T20:32:42.6089990Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6090081Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6090151Z         self,
2025-05-07T20:32:42.6090221Z         T: int,
2025-05-07T20:32:42.6090299Z         D: int,
2025-05-07T20:32:42.6090428Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6090519Z         contiguous: bool,
2025-05-07T20:32:42.6090599Z         compiled: bool,
2025-05-07T20:32:42.6090671Z     ) -> None:
2025-05-07T20:32:42.6090764Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6090828Z     
2025-05-07T20:32:42.6091033Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6091106Z     
2025-05-07T20:32:42.6091196Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6091313Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6091401Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6091476Z         x0 = x[:, :D]
2025-05-07T20:32:42.6091557Z         x1 = x[:, D:]
2025-05-07T20:32:42.6091627Z     
2025-05-07T20:32:42.6091704Z         if contiguous:
2025-05-07T20:32:42.6091788Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6091873Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6091942Z     
2025-05-07T20:32:42.6092034Z         if scale_ub is not None:
2025-05-07T20:32:42.6092133Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6092262Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6092339Z             )
2025-05-07T20:32:42.6092409Z         else:
2025-05-07T20:32:42.6092497Z             scale_ub_tensor = None
2025-05-07T20:32:42.6092609Z     
2025-05-07T20:32:42.6092733Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6092816Z             op = silu_mul_quant
2025-05-07T20:32:42.6092899Z             if compiled:
2025-05-07T20:32:42.6092992Z                 op = torch.compile(op)
2025-05-07T20:32:42.6093091Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6093162Z     
2025-05-07T20:32:42.6093248Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6093252Z 
2025-05-07T20:32:42.6093344Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6093468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6093567Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6093665Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6094032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6094118Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6094612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6094701Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6095054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6095276Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6095608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6095789Z     kernel = self.compile(
2025-05-07T20:32:42.6096198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6096400Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6101870Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6101887Z 
2025-05-07T20:32:42.6102108Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8605340>
2025-05-07T20:32:42.6102894Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6103493Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc845c550>}
2025-05-07T20:32:42.6104700Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6104900Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8489070>
2025-05-07T20:32:42.6104906Z 
2025-05-07T20:32:42.6105071Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6105409Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6105513Z                            module_map=module_map)
2025-05-07T20:32:42.6105673Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6105770Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6105846Z E       ^
2025-05-07T20:32:42.6106197Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6106206Z 
2025-05-07T20:32:42.6106619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6106623Z 
2025-05-07T20:32:42.6106721Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6106940Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6107015Z     T=4096,
2025-05-07T20:32:42.6107147Z     D=7168,
2025-05-07T20:32:42.6107231Z     scale_ub=None,
2025-05-07T20:32:42.6107312Z     contiguous=False,
2025-05-07T20:32:42.6107390Z     compiled=True,
2025-05-07T20:32:42.6107459Z )
2025-05-07T20:32:42.6107678Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6107852Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.6107859Z 
2025-05-07T20:32:42.6107932Z     @given(
2025-05-07T20:32:42.6108046Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6108143Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6108254Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6108366Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6108477Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6108548Z     )
2025-05-07T20:32:42.6108788Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6108885Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6108957Z         self,
2025-05-07T20:32:42.6109034Z         T: int,
2025-05-07T20:32:42.6109105Z         D: int,
2025-05-07T20:32:42.6109199Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6109286Z         contiguous: bool,
2025-05-07T20:32:42.6109365Z         compiled: bool,
2025-05-07T20:32:42.6109444Z     ) -> None:
2025-05-07T20:32:42.6109539Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6109607Z     
2025-05-07T20:32:42.6109771Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6109897Z     
2025-05-07T20:32:42.6109989Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6110108Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6110195Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6110268Z         x0 = x[:, :D]
2025-05-07T20:32:42.6110344Z         x1 = x[:, D:]
2025-05-07T20:32:42.6110409Z     
2025-05-07T20:32:42.6110494Z         if contiguous:
2025-05-07T20:32:42.6110584Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6110666Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6110731Z     
2025-05-07T20:32:42.6110819Z         if scale_ub is not None:
2025-05-07T20:32:42.6110919Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6111051Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6111193Z             )
2025-05-07T20:32:42.6111265Z         else:
2025-05-07T20:32:42.6111355Z             scale_ub_tensor = None
2025-05-07T20:32:42.6111427Z     
2025-05-07T20:32:42.6111558Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6111681Z             op = silu_mul_quant
2025-05-07T20:32:42.6111765Z             if compiled:
2025-05-07T20:32:42.6111861Z                 op = torch.compile(op)
2025-05-07T20:32:42.6111969Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6112036Z     
2025-05-07T20:32:42.6112126Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6112170Z 
2025-05-07T20:32:42.6112268Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6112392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6112487Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6112583Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6112948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6113043Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6113541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6113633Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6113985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6114205Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6114581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6114675Z     kernel = self.compile(
2025-05-07T20:32:42.6115048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6115223Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6115347Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6115352Z 
2025-05-07T20:32:42.6115554Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc844df40>
2025-05-07T20:32:42.6116386Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6116892Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8550160>}
2025-05-07T20:32:42.6117641Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6117832Z context = <triton._C.libtriton.ir.context object at 0x7f9fc85801f0>
2025-05-07T20:32:42.6117837Z 
2025-05-07T20:32:42.6117999Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6118264Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6118365Z                            module_map=module_map)
2025-05-07T20:32:42.6118527Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6118621Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6118692Z E       ^
2025-05-07T20:32:42.6119046Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6119053Z 
2025-05-07T20:32:42.6119463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6119468Z 
2025-05-07T20:32:42.6119563Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6119825Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6119897Z     T=16384,
2025-05-07T20:32:42.6119972Z     D=5120,
2025-05-07T20:32:42.6120050Z     scale_ub=1200.0,
2025-05-07T20:32:42.6120132Z     contiguous=False,
2025-05-07T20:32:42.6120253Z     compiled=False,
2025-05-07T20:32:42.6120321Z )
2025-05-07T20:32:42.6120533Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6120713Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.6120718Z 
2025-05-07T20:32:42.6120828Z     @given(
2025-05-07T20:32:42.6120942Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6121040Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6121154Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6121271Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6121377Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6121447Z     )
2025-05-07T20:32:42.6121692Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6121779Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6121851Z         self,
2025-05-07T20:32:42.6121928Z         T: int,
2025-05-07T20:32:42.6122001Z         D: int,
2025-05-07T20:32:42.6122094Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6122181Z         contiguous: bool,
2025-05-07T20:32:42.6122261Z         compiled: bool,
2025-05-07T20:32:42.6122333Z     ) -> None:
2025-05-07T20:32:42.6122427Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6122537Z     
2025-05-07T20:32:42.6122704Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6122776Z     
2025-05-07T20:32:42.6122863Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6122983Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6123065Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6123140Z         x0 = x[:, :D]
2025-05-07T20:32:42.6123220Z         x1 = x[:, D:]
2025-05-07T20:32:42.6123288Z     
2025-05-07T20:32:42.6123364Z         if contiguous:
2025-05-07T20:32:42.6123455Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6123540Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6123610Z     
2025-05-07T20:32:42.6123700Z         if scale_ub is not None:
2025-05-07T20:32:42.6123801Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6123934Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6124006Z             )
2025-05-07T20:32:42.6124079Z         else:
2025-05-07T20:32:42.6124172Z             scale_ub_tensor = None
2025-05-07T20:32:42.6124238Z     
2025-05-07T20:32:42.6124362Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6124449Z             op = silu_mul_quant
2025-05-07T20:32:42.6124529Z             if compiled:
2025-05-07T20:32:42.6124625Z                 op = torch.compile(op)
2025-05-07T20:32:42.6124729Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6124797Z     
2025-05-07T20:32:42.6124880Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6124885Z 
2025-05-07T20:32:42.6124977Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6125102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6125199Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6125293Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6125792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6125892Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6126268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6126512Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6126853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6126993Z     kernel = self.compile(
2025-05-07T20:32:42.6127368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6127579Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6127703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6127707Z 
2025-05-07T20:32:42.6127911Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc83da490>
2025-05-07T20:32:42.6128726Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6129229Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8550940>}
2025-05-07T20:32:42.6129973Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6130161Z context = <triton._C.libtriton.ir.context object at 0x7f9fc83d6fb0>
2025-05-07T20:32:42.6130165Z 
2025-05-07T20:32:42.6130325Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6130643Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6130752Z                            module_map=module_map)
2025-05-07T20:32:42.6130910Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6131005Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6131086Z E       ^
2025-05-07T20:32:42.6131442Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6131449Z 
2025-05-07T20:32:42.6131859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6131864Z 
2025-05-07T20:32:42.6131965Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6132181Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6132256Z     T=16384,
2025-05-07T20:32:42.6132326Z     D=5120,
2025-05-07T20:32:42.6132405Z     scale_ub=1200.0,
2025-05-07T20:32:42.6132489Z     contiguous=True,
2025-05-07T20:32:42.6132572Z     compiled=True,
2025-05-07T20:32:42.6132640Z )
2025-05-07T20:32:42.6132863Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6133033Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.6133038Z 
2025-05-07T20:32:42.6133111Z     @given(
2025-05-07T20:32:42.6133226Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6133325Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6133440Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6133552Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6133662Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6133735Z     )
2025-05-07T20:32:42.6133979Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6134070Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6134143Z         self,
2025-05-07T20:32:42.6134220Z         T: int,
2025-05-07T20:32:42.6134290Z         D: int,
2025-05-07T20:32:42.6134385Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6134468Z         contiguous: bool,
2025-05-07T20:32:42.6134553Z         compiled: bool,
2025-05-07T20:32:42.6134626Z     ) -> None:
2025-05-07T20:32:42.6134715Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6134784Z     
2025-05-07T20:32:42.6134991Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6135061Z     
2025-05-07T20:32:42.6135149Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6135266Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6135387Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6135468Z         x0 = x[:, :D]
2025-05-07T20:32:42.6135544Z         x1 = x[:, D:]
2025-05-07T20:32:42.6135614Z     
2025-05-07T20:32:42.6135692Z         if contiguous:
2025-05-07T20:32:42.6135777Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6135863Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6135987Z     
2025-05-07T20:32:42.6136074Z         if scale_ub is not None:
2025-05-07T20:32:42.6136178Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6136308Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6136381Z             )
2025-05-07T20:32:42.6136454Z         else:
2025-05-07T20:32:42.6136542Z             scale_ub_tensor = None
2025-05-07T20:32:42.6136622Z     
2025-05-07T20:32:42.6136767Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6136870Z             op = silu_mul_quant
2025-05-07T20:32:42.6136956Z             if compiled:
2025-05-07T20:32:42.6137054Z                 op = torch.compile(op)
2025-05-07T20:32:42.6137154Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6137223Z     
2025-05-07T20:32:42.6137309Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6137314Z 
2025-05-07T20:32:42.6137405Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6137575Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6137675Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6137769Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6138135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6138221Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6138715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6138808Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6139162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6139383Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6139712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6139808Z     kernel = self.compile(
2025-05-07T20:32:42.6140182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6140353Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6140478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6140485Z 
2025-05-07T20:32:42.6140686Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8450b20>
2025-05-07T20:32:42.6141461Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6141964Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8351550>}
2025-05-07T20:32:42.6142708Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6142899Z context = <triton._C.libtriton.ir.context object at 0x7f9fc836ef70>
2025-05-07T20:32:42.6142904Z 
2025-05-07T20:32:42.6143063Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6143366Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6143469Z                            module_map=module_map)
2025-05-07T20:32:42.6143663Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6143764Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6143837Z E       ^
2025-05-07T20:32:42.6144194Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6144240Z 
2025-05-07T20:32:42.6144654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6144659Z 
2025-05-07T20:32:42.6144755Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6144975Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6145052Z     T=16384,
2025-05-07T20:32:42.6145125Z     D=5120,
2025-05-07T20:32:42.6145203Z     scale_ub=None,
2025-05-07T20:32:42.6145286Z     contiguous=False,
2025-05-07T20:32:42.6145363Z     compiled=True,
2025-05-07T20:32:42.6145434Z )
2025-05-07T20:32:42.6145671Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6145874Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.6145880Z 
2025-05-07T20:32:42.6145960Z     @given(
2025-05-07T20:32:42.6146073Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6146209Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6146324Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6146435Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6146555Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6146621Z     )
2025-05-07T20:32:42.6146862Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6146958Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6147031Z         self,
2025-05-07T20:32:42.6147106Z         T: int,
2025-05-07T20:32:42.6147177Z         D: int,
2025-05-07T20:32:42.6147271Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6147362Z         contiguous: bool,
2025-05-07T20:32:42.6147442Z         compiled: bool,
2025-05-07T20:32:42.6147514Z     ) -> None:
2025-05-07T20:32:42.6147606Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6147675Z     
2025-05-07T20:32:42.6147837Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6147916Z     
2025-05-07T20:32:42.6148003Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6148122Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6148207Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6148281Z         x0 = x[:, :D]
2025-05-07T20:32:42.6148358Z         x1 = x[:, D:]
2025-05-07T20:32:42.6148428Z     
2025-05-07T20:32:42.6148506Z         if contiguous:
2025-05-07T20:32:42.6148595Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6148679Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6148747Z     
2025-05-07T20:32:42.6148835Z         if scale_ub is not None:
2025-05-07T20:32:42.6148939Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6149071Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6149144Z             )
2025-05-07T20:32:42.6149213Z         else:
2025-05-07T20:32:42.6149302Z             scale_ub_tensor = None
2025-05-07T20:32:42.6149369Z     
2025-05-07T20:32:42.6149498Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6149583Z             op = silu_mul_quant
2025-05-07T20:32:42.6149665Z             if compiled:
2025-05-07T20:32:42.6149760Z                 op = torch.compile(op)
2025-05-07T20:32:42.6149907Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6149975Z     
2025-05-07T20:32:42.6150060Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6150111Z 
2025-05-07T20:32:42.6150205Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6150334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6150428Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6150560Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6150925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6151013Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6151512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6151644Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6151997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6152217Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6152553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6152647Z     kernel = self.compile(
2025-05-07T20:32:42.6153025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6153202Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6153322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6153327Z 
2025-05-07T20:32:42.6153567Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8415b20>
2025-05-07T20:32:42.6154350Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6154851Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc85361f0>}
2025-05-07T20:32:42.6155598Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6155785Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8510fb0>
2025-05-07T20:32:42.6155790Z 
2025-05-07T20:32:42.6155948Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6156214Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6156316Z                            module_map=module_map)
2025-05-07T20:32:42.6156475Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6156571Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6156638Z E       ^
2025-05-07T20:32:42.6157008Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6157013Z 
2025-05-07T20:32:42.6157425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6157433Z 
2025-05-07T20:32:42.6157532Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6157748Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6157824Z     T=2048,
2025-05-07T20:32:42.6157893Z     D=5120,
2025-05-07T20:32:42.6157969Z     scale_ub=None,
2025-05-07T20:32:42.6158057Z     contiguous=False,
2025-05-07T20:32:42.6158135Z     compiled=True,
2025-05-07T20:32:42.6158203Z )
2025-05-07T20:32:42.6158418Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6158587Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.6158592Z 
2025-05-07T20:32:42.6158705Z     @given(
2025-05-07T20:32:42.6158819Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6158913Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6159028Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6159179Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6159288Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6159361Z     )
2025-05-07T20:32:42.6159604Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6159695Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6159831Z         self,
2025-05-07T20:32:42.6159906Z         T: int,
2025-05-07T20:32:42.6159976Z         D: int,
2025-05-07T20:32:42.6160074Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6160157Z         contiguous: bool,
2025-05-07T20:32:42.6160237Z         compiled: bool,
2025-05-07T20:32:42.6160312Z     ) -> None:
2025-05-07T20:32:42.6160404Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6160475Z     
2025-05-07T20:32:42.6160636Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6160707Z     
2025-05-07T20:32:42.6160796Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6160917Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6161001Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6161076Z         x0 = x[:, :D]
2025-05-07T20:32:42.6161152Z         x1 = x[:, D:]
2025-05-07T20:32:42.6161219Z     
2025-05-07T20:32:42.6161300Z         if contiguous:
2025-05-07T20:32:42.6161385Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6161520Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6161591Z     
2025-05-07T20:32:42.6161678Z         if scale_ub is not None:
2025-05-07T20:32:42.6161781Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6161909Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6161980Z             )
2025-05-07T20:32:42.6162056Z         else:
2025-05-07T20:32:42.6162144Z             scale_ub_tensor = None
2025-05-07T20:32:42.6162211Z     
2025-05-07T20:32:42.6162337Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6162419Z             op = silu_mul_quant
2025-05-07T20:32:42.6162500Z             if compiled:
2025-05-07T20:32:42.6162597Z                 op = torch.compile(op)
2025-05-07T20:32:42.6162698Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6162771Z     
2025-05-07T20:32:42.6162856Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6162861Z 
2025-05-07T20:32:42.6162953Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6163082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6163180Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6163276Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6163640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6163729Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6164220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6164311Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6164666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6164887Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6165223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6165315Z     kernel = self.compile(
2025-05-07T20:32:42.6165695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6165865Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6165989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6166041Z 
2025-05-07T20:32:42.6166243Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8533ca0>
2025-05-07T20:32:42.6167057Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6167563Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8536f70>}
2025-05-07T20:32:42.6168336Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6168525Z context = <triton._C.libtriton.ir.context object at 0x7f9fc81ed170>
2025-05-07T20:32:42.6168532Z 
2025-05-07T20:32:42.6168692Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6168951Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6169060Z                            module_map=module_map)
2025-05-07T20:32:42.6169219Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6169312Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6169380Z E       ^
2025-05-07T20:32:42.6169776Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6169784Z 
2025-05-07T20:32:42.6170197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6170202Z 
2025-05-07T20:32:42.6170298Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6170519Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6170591Z     T=2048,
2025-05-07T20:32:42.6170661Z     D=5120,
2025-05-07T20:32:42.6170744Z     scale_ub=1200.0,
2025-05-07T20:32:42.6170825Z     contiguous=False,
2025-05-07T20:32:42.6170903Z     compiled=True,
2025-05-07T20:32:42.6170971Z )
2025-05-07T20:32:42.6171191Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6171360Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.6171364Z 
2025-05-07T20:32:42.6171436Z     @given(
2025-05-07T20:32:42.6171552Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6171652Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6171761Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6171872Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6171984Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6172054Z     )
2025-05-07T20:32:42.6172298Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6172390Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6172460Z         self,
2025-05-07T20:32:42.6172527Z         T: int,
2025-05-07T20:32:42.6172601Z         D: int,
2025-05-07T20:32:42.6172699Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6172782Z         contiguous: bool,
2025-05-07T20:32:42.6172864Z         compiled: bool,
2025-05-07T20:32:42.6172939Z     ) -> None:
2025-05-07T20:32:42.6173034Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6173102Z     
2025-05-07T20:32:42.6173267Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6173342Z     
2025-05-07T20:32:42.6173426Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6173544Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6173631Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6173703Z         x0 = x[:, :D]
2025-05-07T20:32:42.6173776Z         x1 = x[:, D:]
2025-05-07T20:32:42.6173893Z     
2025-05-07T20:32:42.6173971Z         if contiguous:
2025-05-07T20:32:42.6174055Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6174141Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6174209Z     
2025-05-07T20:32:42.6174338Z         if scale_ub is not None:
2025-05-07T20:32:42.6174440Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6174571Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6174645Z             )
2025-05-07T20:32:42.6174716Z         else:
2025-05-07T20:32:42.6174807Z             scale_ub_tensor = None
2025-05-07T20:32:42.6174920Z     
2025-05-07T20:32:42.6175044Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6175127Z             op = silu_mul_quant
2025-05-07T20:32:42.6175209Z             if compiled:
2025-05-07T20:32:42.6175303Z                 op = torch.compile(op)
2025-05-07T20:32:42.6175402Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6175477Z     
2025-05-07T20:32:42.6175561Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6175566Z 
2025-05-07T20:32:42.6175662Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6175786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6175897Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6176005Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6176390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6176478Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6177012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6177106Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6177459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6177679Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6178015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6178109Z     kernel = self.compile(
2025-05-07T20:32:42.6178487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6178655Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6178780Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6178788Z 
2025-05-07T20:32:42.6178993Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8201c70>
2025-05-07T20:32:42.6179775Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6180282Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc81f7940>}
2025-05-07T20:32:42.6181027Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6181214Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8320fb0>
2025-05-07T20:32:42.6181219Z 
2025-05-07T20:32:42.6181382Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6181646Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6181750Z                            module_map=module_map)
2025-05-07T20:32:42.6181912Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6182006Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6182120Z E       ^
2025-05-07T20:32:42.6182479Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6182484Z 
2025-05-07T20:32:42.6182928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6182933Z 
2025-05-07T20:32:42.6183030Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6183249Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6183321Z     T=4096,
2025-05-07T20:32:42.6183434Z     D=5120,
2025-05-07T20:32:42.6183512Z     scale_ub=1200.0,
2025-05-07T20:32:42.6183592Z     contiguous=True,
2025-05-07T20:32:42.6183672Z     compiled=True,
2025-05-07T20:32:42.6183738Z )
2025-05-07T20:32:42.6183949Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6184117Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.6184125Z 
2025-05-07T20:32:42.6184196Z     @given(
2025-05-07T20:32:42.6184308Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6184403Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6184517Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6184632Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6184741Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6184808Z     )
2025-05-07T20:32:42.6185052Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6185185Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6185260Z         self,
2025-05-07T20:32:42.6185332Z         T: int,
2025-05-07T20:32:42.6185403Z         D: int,
2025-05-07T20:32:42.6185497Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6185581Z         contiguous: bool,
2025-05-07T20:32:42.6185660Z         compiled: bool,
2025-05-07T20:32:42.6185732Z     ) -> None:
2025-05-07T20:32:42.6185828Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6185893Z     
2025-05-07T20:32:42.6186065Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6186152Z     
2025-05-07T20:32:42.6186245Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6186387Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6186469Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6186544Z         x0 = x[:, :D]
2025-05-07T20:32:42.6186619Z         x1 = x[:, D:]
2025-05-07T20:32:42.6186682Z     
2025-05-07T20:32:42.6186760Z         if contiguous:
2025-05-07T20:32:42.6186855Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6186938Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6187007Z     
2025-05-07T20:32:42.6187095Z         if scale_ub is not None:
2025-05-07T20:32:42.6187195Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6187326Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6187401Z             )
2025-05-07T20:32:42.6187474Z         else:
2025-05-07T20:32:42.6187564Z             scale_ub_tensor = None
2025-05-07T20:32:42.6187631Z     
2025-05-07T20:32:42.6187754Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6187845Z             op = silu_mul_quant
2025-05-07T20:32:42.6187923Z             if compiled:
2025-05-07T20:32:42.6188017Z                 op = torch.compile(op)
2025-05-07T20:32:42.6188121Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6188188Z     
2025-05-07T20:32:42.6188275Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6188282Z 
2025-05-07T20:32:42.6188381Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6188505Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6188603Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6188697Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6189060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6189231Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6189718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6189925Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6190283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6190501Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6190880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6190970Z     kernel = self.compile(
2025-05-07T20:32:42.6191344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6191517Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6191640Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6191645Z 
2025-05-07T20:32:42.6191848Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc83429a0>
2025-05-07T20:32:42.6192628Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6193167Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc82b4790>}
2025-05-07T20:32:42.6193922Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6194108Z context = <triton._C.libtriton.ir.context object at 0x7f9fc81043f0>
2025-05-07T20:32:42.6194119Z 
2025-05-07T20:32:42.6194285Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6194542Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6194646Z                            module_map=module_map)
2025-05-07T20:32:42.6194807Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6194898Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6194974Z E       ^
2025-05-07T20:32:42.6195334Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6195341Z 
2025-05-07T20:32:42.6195749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6195754Z 
2025-05-07T20:32:42.6195855Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6196074Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6196148Z     T=128,
2025-05-07T20:32:42.6196222Z     D=5120,
2025-05-07T20:32:42.6196300Z     scale_ub=1200.0,
2025-05-07T20:32:42.6196402Z     contiguous=False,
2025-05-07T20:32:42.6196490Z     compiled=True,
2025-05-07T20:32:42.6196571Z )
2025-05-07T20:32:42.6196797Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6196963Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.6196967Z 
2025-05-07T20:32:42.6197037Z     @given(
2025-05-07T20:32:42.6197162Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6197257Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6197367Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6197481Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6197589Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6197708Z     )
2025-05-07T20:32:42.6197949Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6198040Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6198111Z         self,
2025-05-07T20:32:42.6198181Z         T: int,
2025-05-07T20:32:42.6198287Z         D: int,
2025-05-07T20:32:42.6198384Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6198468Z         contiguous: bool,
2025-05-07T20:32:42.6198548Z         compiled: bool,
2025-05-07T20:32:42.6198625Z     ) -> None:
2025-05-07T20:32:42.6198715Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6198827Z     
2025-05-07T20:32:42.6198996Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6199064Z     
2025-05-07T20:32:42.6199151Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6199268Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6199351Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6199426Z         x0 = x[:, :D]
2025-05-07T20:32:42.6199504Z         x1 = x[:, D:]
2025-05-07T20:32:42.6199572Z     
2025-05-07T20:32:42.6199651Z         if contiguous:
2025-05-07T20:32:42.6199738Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6199823Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6199892Z     
2025-05-07T20:32:42.6199983Z         if scale_ub is not None:
2025-05-07T20:32:42.6200086Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6200216Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6200289Z             )
2025-05-07T20:32:42.6200361Z         else:
2025-05-07T20:32:42.6200494Z             scale_ub_tensor = None
2025-05-07T20:32:42.6200563Z     
2025-05-07T20:32:42.6200693Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6200776Z             op = silu_mul_quant
2025-05-07T20:32:42.6200854Z             if compiled:
2025-05-07T20:32:42.6200950Z                 op = torch.compile(op)
2025-05-07T20:32:42.6201051Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6201123Z     
2025-05-07T20:32:42.6201212Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6201217Z 
2025-05-07T20:32:42.6201308Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6201434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6201533Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6201626Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6201992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6202082Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6202579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6202674Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6203027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6203248Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6203582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6203673Z     kernel = self.compile(
2025-05-07T20:32:42.6204363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6204538Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6204664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6204672Z 
2025-05-07T20:32:42.6204877Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc82d2280>
2025-05-07T20:32:42.6205655Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6206256Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc81370d0>}
2025-05-07T20:32:42.6207054Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6207245Z context = <triton._C.libtriton.ir.context object at 0x7f9fc811fb30>
2025-05-07T20:32:42.6207250Z 
2025-05-07T20:32:42.6207468Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6207726Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6207830Z                            module_map=module_map)
2025-05-07T20:32:42.6207987Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6208083Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6208157Z E       ^
2025-05-07T20:32:42.6208507Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6208512Z 
2025-05-07T20:32:42.6208927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6208932Z 
2025-05-07T20:32:42.6209030Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6209244Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6209377Z     T=16384,
2025-05-07T20:32:42.6209450Z     D=7168,
2025-05-07T20:32:42.6209528Z     scale_ub=1200.0,
2025-05-07T20:32:42.6209611Z     contiguous=True,
2025-05-07T20:32:42.6209689Z     compiled=True,
2025-05-07T20:32:42.6209759Z )
2025-05-07T20:32:42.6209970Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6210138Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.6210146Z 
2025-05-07T20:32:42.6210222Z     @given(
2025-05-07T20:32:42.6210334Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6210427Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6210544Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6210656Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6210765Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6210838Z     )
2025-05-07T20:32:42.6211081Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6211174Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6211242Z         self,
2025-05-07T20:32:42.6211311Z         T: int,
2025-05-07T20:32:42.6211385Z         D: int,
2025-05-07T20:32:42.6211479Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6211562Z         contiguous: bool,
2025-05-07T20:32:42.6211648Z         compiled: bool,
2025-05-07T20:32:42.6211720Z     ) -> None:
2025-05-07T20:32:42.6211809Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6211876Z     
2025-05-07T20:32:42.6212040Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6212108Z     
2025-05-07T20:32:42.6212202Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6212320Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6212407Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6212479Z         x0 = x[:, :D]
2025-05-07T20:32:42.6212551Z         x1 = x[:, D:]
2025-05-07T20:32:42.6212623Z     
2025-05-07T20:32:42.6212703Z         if contiguous:
2025-05-07T20:32:42.6212790Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6212878Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6212946Z     
2025-05-07T20:32:42.6213032Z         if scale_ub is not None:
2025-05-07T20:32:42.6213138Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6213266Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6213388Z             )
2025-05-07T20:32:42.6213465Z         else:
2025-05-07T20:32:42.6213554Z             scale_ub_tensor = None
2025-05-07T20:32:42.6213622Z     
2025-05-07T20:32:42.6213788Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6213878Z             op = silu_mul_quant
2025-05-07T20:32:42.6213962Z             if compiled:
2025-05-07T20:32:42.6214056Z                 op = torch.compile(op)
2025-05-07T20:32:42.6214158Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6214227Z     
2025-05-07T20:32:42.6214354Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6214358Z 
2025-05-07T20:32:42.6214452Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6214578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6214675Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6214767Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6215136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6215222Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6215714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6215803Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6216187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6216471Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6216809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6216902Z     kernel = self.compile(
2025-05-07T20:32:42.6217278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6217451Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6217576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6217580Z 
2025-05-07T20:32:42.6217784Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8080a60>
2025-05-07T20:32:42.6218560Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6219065Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8137d30>}
2025-05-07T20:32:42.6219806Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6219997Z context = <triton._C.libtriton.ir.context object at 0x7f9fc7f56170>
2025-05-07T20:32:42.6220002Z 
2025-05-07T20:32:42.6220162Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6220427Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6226190Z                            module_map=module_map)
2025-05-07T20:32:42.6226377Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6226477Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6226550Z E       ^
2025-05-07T20:32:42.6226921Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6226926Z 
2025-05-07T20:32:42.6227351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6227356Z 
2025-05-07T20:32:42.6227455Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6227742Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6227816Z     T=16384,
2025-05-07T20:32:42.6227888Z     D=5120,
2025-05-07T20:32:42.6227970Z     scale_ub=1200.0,
2025-05-07T20:32:42.6228088Z     contiguous=True,
2025-05-07T20:32:42.6228167Z     compiled=False,
2025-05-07T20:32:42.6228241Z )
2025-05-07T20:32:42.6228452Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6228626Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.6228669Z 
2025-05-07T20:32:42.6228747Z     @given(
2025-05-07T20:32:42.6228863Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6228956Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6229072Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6229184Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6229298Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6229367Z     )
2025-05-07T20:32:42.6229609Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6229700Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6229772Z         self,
2025-05-07T20:32:42.6229904Z         T: int,
2025-05-07T20:32:42.6229981Z         D: int,
2025-05-07T20:32:42.6230075Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6230160Z         contiguous: bool,
2025-05-07T20:32:42.6230244Z         compiled: bool,
2025-05-07T20:32:42.6230320Z     ) -> None:
2025-05-07T20:32:42.6230457Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6230528Z     
2025-05-07T20:32:42.6230692Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6230766Z     
2025-05-07T20:32:42.6230852Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6230971Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6231056Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6231133Z         x0 = x[:, :D]
2025-05-07T20:32:42.6231206Z         x1 = x[:, D:]
2025-05-07T20:32:42.6231278Z     
2025-05-07T20:32:42.6231357Z         if contiguous:
2025-05-07T20:32:42.6231443Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6231534Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6231599Z     
2025-05-07T20:32:42.6231684Z         if scale_ub is not None:
2025-05-07T20:32:42.6231787Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6231920Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6231995Z             )
2025-05-07T20:32:42.6232073Z         else:
2025-05-07T20:32:42.6232162Z             scale_ub_tensor = None
2025-05-07T20:32:42.6232234Z     
2025-05-07T20:32:42.6232359Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6232446Z             op = silu_mul_quant
2025-05-07T20:32:42.6232530Z             if compiled:
2025-05-07T20:32:42.6232624Z                 op = torch.compile(op)
2025-05-07T20:32:42.6232729Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6232799Z     
2025-05-07T20:32:42.6232884Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6232889Z 
2025-05-07T20:32:42.6232978Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6233110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6233207Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6233305Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6233806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6233902Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6234263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6234484Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6234873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6234962Z     kernel = self.compile(
2025-05-07T20:32:42.6235374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6235548Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6235668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6235673Z 
2025-05-07T20:32:42.6235874Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8030160>
2025-05-07T20:32:42.6236693Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6237197Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc831d700>}
2025-05-07T20:32:42.6237944Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6238130Z context = <triton._C.libtriton.ir.context object at 0x7f9fc82e2ab0>
2025-05-07T20:32:42.6238134Z 
2025-05-07T20:32:42.6238298Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6238595Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6238700Z                            module_map=module_map)
2025-05-07T20:32:42.6238863Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6238958Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6239031Z E       ^
2025-05-07T20:32:42.6239393Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6239401Z 
2025-05-07T20:32:42.6239810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6239817Z 
2025-05-07T20:32:42.6239917Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6240135Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6240204Z     T=1,
2025-05-07T20:32:42.6240275Z     D=7168,
2025-05-07T20:32:42.6240350Z     scale_ub=1200.0,
2025-05-07T20:32:42.6240435Z     contiguous=False,
2025-05-07T20:32:42.6240519Z     compiled=False,
2025-05-07T20:32:42.6240589Z )
2025-05-07T20:32:42.6240800Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6240963Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.6240968Z 
2025-05-07T20:32:42.6241038Z     @given(
2025-05-07T20:32:42.6241153Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6241250Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6241361Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6241477Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6241586Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6241656Z     )
2025-05-07T20:32:42.6241901Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6241990Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6242062Z         self,
2025-05-07T20:32:42.6242139Z         T: int,
2025-05-07T20:32:42.6242211Z         D: int,
2025-05-07T20:32:42.6242303Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6242391Z         contiguous: bool,
2025-05-07T20:32:42.6242472Z         compiled: bool,
2025-05-07T20:32:42.6242548Z     ) -> None:
2025-05-07T20:32:42.6242640Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6242755Z     
2025-05-07T20:32:42.6242925Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6242993Z     
2025-05-07T20:32:42.6243079Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6243202Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6243325Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6243400Z         x0 = x[:, :D]
2025-05-07T20:32:42.6243476Z         x1 = x[:, D:]
2025-05-07T20:32:42.6243543Z     
2025-05-07T20:32:42.6243621Z         if contiguous:
2025-05-07T20:32:42.6243711Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6243837Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6243904Z     
2025-05-07T20:32:42.6243994Z         if scale_ub is not None:
2025-05-07T20:32:42.6244094Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6244225Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6244297Z             )
2025-05-07T20:32:42.6244370Z         else:
2025-05-07T20:32:42.6244465Z             scale_ub_tensor = None
2025-05-07T20:32:42.6244533Z     
2025-05-07T20:32:42.6244655Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6244745Z             op = silu_mul_quant
2025-05-07T20:32:42.6244824Z             if compiled:
2025-05-07T20:32:42.6244921Z                 op = torch.compile(op)
2025-05-07T20:32:42.6245027Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6245093Z     
2025-05-07T20:32:42.6245181Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6245185Z 
2025-05-07T20:32:42.6245278Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6245443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6245543Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6245638Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6246142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6246241Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6246593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6246813Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6247148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6247235Z     kernel = self.compile(
2025-05-07T20:32:42.6247614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6247787Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6247907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6247911Z 
2025-05-07T20:32:42.6248116Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8307d90>
2025-05-07T20:32:42.6248887Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6249394Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc82540d0>}
2025-05-07T20:32:42.6250134Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6250324Z context = <triton._C.libtriton.ir.context object at 0x7f9fc82539f0>
2025-05-07T20:32:42.6250329Z 
2025-05-07T20:32:42.6250492Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6250751Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6250908Z                            module_map=module_map)
2025-05-07T20:32:42.6251066Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6251163Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6251241Z E       ^
2025-05-07T20:32:42.6251626Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6251631Z 
2025-05-07T20:32:42.6252042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6252108Z 
2025-05-07T20:32:42.6252207Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6252423Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6252499Z     T=4096,
2025-05-07T20:32:42.6252569Z     D=7168,
2025-05-07T20:32:42.6252645Z     scale_ub=1200.0,
2025-05-07T20:32:42.6252729Z     contiguous=False,
2025-05-07T20:32:42.6252811Z     compiled=True,
2025-05-07T20:32:42.6252879Z )
2025-05-07T20:32:42.6253100Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6253269Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.6253273Z 
2025-05-07T20:32:42.6253351Z     @given(
2025-05-07T20:32:42.6253467Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6253559Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6253671Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6253822Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6253935Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6254010Z     )
2025-05-07T20:32:42.6254250Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6254340Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6254410Z         self,
2025-05-07T20:32:42.6254479Z         T: int,
2025-05-07T20:32:42.6254558Z         D: int,
2025-05-07T20:32:42.6254650Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6254734Z         contiguous: bool,
2025-05-07T20:32:42.6254817Z         compiled: bool,
2025-05-07T20:32:42.6254890Z     ) -> None:
2025-05-07T20:32:42.6254984Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6255053Z     
2025-05-07T20:32:42.6255219Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6255289Z     
2025-05-07T20:32:42.6255378Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6255496Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6255590Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6255664Z         x0 = x[:, :D]
2025-05-07T20:32:42.6255739Z         x1 = x[:, D:]
2025-05-07T20:32:42.6255810Z     
2025-05-07T20:32:42.6255889Z         if contiguous:
2025-05-07T20:32:42.6255973Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6256062Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6256135Z     
2025-05-07T20:32:42.6256221Z         if scale_ub is not None:
2025-05-07T20:32:42.6256323Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6256454Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6256524Z             )
2025-05-07T20:32:42.6256602Z         else:
2025-05-07T20:32:42.6256690Z             scale_ub_tensor = None
2025-05-07T20:32:42.6256755Z     
2025-05-07T20:32:42.6256883Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6256968Z             op = silu_mul_quant
2025-05-07T20:32:42.6257050Z             if compiled:
2025-05-07T20:32:42.6257150Z                 op = torch.compile(op)
2025-05-07T20:32:42.6257251Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6257323Z     
2025-05-07T20:32:42.6257408Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6257413Z 
2025-05-07T20:32:42.6257505Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6257631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6257774Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6257869Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6258268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6258359Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6258846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6258938Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6259329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6259552Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6259886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6259980Z     kernel = self.compile(
2025-05-07T20:32:42.6260359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6260528Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6260658Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6260663Z 
2025-05-07T20:32:42.6260863Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc7fd1c10>
2025-05-07T20:32:42.6261676Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6262188Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc8254dc0>}
2025-05-07T20:32:42.6262925Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6263124Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8156570>
2025-05-07T20:32:42.6263129Z 
2025-05-07T20:32:42.6263290Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6263550Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6263651Z                            module_map=module_map)
2025-05-07T20:32:42.6263812Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6263908Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6263982Z E       ^
2025-05-07T20:32:42.6264329Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6264334Z 
2025-05-07T20:32:42.6264746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6264751Z 
2025-05-07T20:32:42.6264849Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6265073Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6265147Z     T=128,
2025-05-07T20:32:42.6265215Z     D=7168,
2025-05-07T20:32:42.6265296Z     scale_ub=1200.0,
2025-05-07T20:32:42.6265378Z     contiguous=False,
2025-05-07T20:32:42.6265456Z     compiled=True,
2025-05-07T20:32:42.6265528Z )
2025-05-07T20:32:42.6265745Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6265934Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.6265940Z 
2025-05-07T20:32:42.6266015Z     @given(
2025-05-07T20:32:42.6266152Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6266248Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6266402Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6266512Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6266623Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6266690Z     )
2025-05-07T20:32:42.6266969Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6267064Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6267133Z         self,
2025-05-07T20:32:42.6267207Z         T: int,
2025-05-07T20:32:42.6267280Z         D: int,
2025-05-07T20:32:42.6267375Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6267500Z         contiguous: bool,
2025-05-07T20:32:42.6267581Z         compiled: bool,
2025-05-07T20:32:42.6267654Z     ) -> None:
2025-05-07T20:32:42.6267749Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6267816Z     
2025-05-07T20:32:42.6267977Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6268054Z     
2025-05-07T20:32:42.6268139Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6268258Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6268344Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6268417Z         x0 = x[:, :D]
2025-05-07T20:32:42.6268496Z         x1 = x[:, D:]
2025-05-07T20:32:42.6268565Z     
2025-05-07T20:32:42.6268643Z         if contiguous:
2025-05-07T20:32:42.6268732Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6268816Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6268885Z     
2025-05-07T20:32:42.6268972Z         if scale_ub is not None:
2025-05-07T20:32:42.6269117Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6269250Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6269324Z             )
2025-05-07T20:32:42.6269398Z         else:
2025-05-07T20:32:42.6269489Z             scale_ub_tensor = None
2025-05-07T20:32:42.6269556Z     
2025-05-07T20:32:42.6269681Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6269767Z             op = silu_mul_quant
2025-05-07T20:32:42.6269900Z             if compiled:
2025-05-07T20:32:42.6269996Z                 op = torch.compile(op)
2025-05-07T20:32:42.6270099Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6270173Z     
2025-05-07T20:32:42.6270260Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6270265Z 
2025-05-07T20:32:42.6270359Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6270481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6270575Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6270677Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6271038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6271124Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6271612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6271706Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6272060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6272280Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6272613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6272702Z     kernel = self.compile(
2025-05-07T20:32:42.6273081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6273257Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6273377Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6273382Z 
2025-05-07T20:32:42.6273583Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc81532b0>
2025-05-07T20:32:42.6274408Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6274943Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc806a940>}
2025-05-07T20:32:42.6275686Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6275914Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8210330>
2025-05-07T20:32:42.6275919Z 
2025-05-07T20:32:42.6276078Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6276339Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6276447Z                            module_map=module_map)
2025-05-07T20:32:42.6276607Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6276702Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6276774Z E       ^
2025-05-07T20:32:42.6277123Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6277128Z 
2025-05-07T20:32:42.6277575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6277584Z 
2025-05-07T20:32:42.6277684Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6277900Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6277970Z     T=2048,
2025-05-07T20:32:42.6278044Z     D=7168,
2025-05-07T20:32:42.6278121Z     scale_ub=None,
2025-05-07T20:32:42.6278204Z     contiguous=True,
2025-05-07T20:32:42.6278286Z     compiled=True,
2025-05-07T20:32:42.6278355Z )
2025-05-07T20:32:42.6278566Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6278744Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.6278749Z 
2025-05-07T20:32:42.6278828Z     @given(
2025-05-07T20:32:42.6278942Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6279036Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6279154Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6279272Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6279381Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6279450Z     )
2025-05-07T20:32:42.6279693Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6279781Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6279861Z         self,
2025-05-07T20:32:42.6279933Z         T: int,
2025-05-07T20:32:42.6280005Z         D: int,
2025-05-07T20:32:42.6280102Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6280184Z         contiguous: bool,
2025-05-07T20:32:42.6280267Z         compiled: bool,
2025-05-07T20:32:42.6280340Z     ) -> None:
2025-05-07T20:32:42.6280429Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6280497Z     
2025-05-07T20:32:42.6280661Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6280731Z     
2025-05-07T20:32:42.6280818Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6280944Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6281028Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6281105Z         x0 = x[:, :D]
2025-05-07T20:32:42.6281182Z         x1 = x[:, D:]
2025-05-07T20:32:42.6281249Z     
2025-05-07T20:32:42.6281330Z         if contiguous:
2025-05-07T20:32:42.6281414Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6281546Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6281612Z     
2025-05-07T20:32:42.6281697Z         if scale_ub is not None:
2025-05-07T20:32:42.6281800Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6281928Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6282062Z             )
2025-05-07T20:32:42.6282139Z         else:
2025-05-07T20:32:42.6282228Z             scale_ub_tensor = None
2025-05-07T20:32:42.6282297Z     
2025-05-07T20:32:42.6282424Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6282508Z             op = silu_mul_quant
2025-05-07T20:32:42.6282630Z             if compiled:
2025-05-07T20:32:42.6282729Z                 op = torch.compile(op)
2025-05-07T20:32:42.6282830Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6282897Z     
2025-05-07T20:32:42.6282983Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6282987Z 
2025-05-07T20:32:42.6283078Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6283207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6283304Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6283397Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6283763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6283849Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6284341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6284474Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6284831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6285056Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6285389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6285480Z     kernel = self.compile(
2025-05-07T20:32:42.6285857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6286030Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6286154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6286158Z 
2025-05-07T20:32:42.6286361Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc8239af0>
2025-05-07T20:32:42.6287188Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6287697Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc7fc9550>}
2025-05-07T20:32:42.6288441Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6288632Z context = <triton._C.libtriton.ir.context object at 0x7f9fc7e7f830>
2025-05-07T20:32:42.6288637Z 
2025-05-07T20:32:42.6288798Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6289056Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6289166Z                            module_map=module_map)
2025-05-07T20:32:42.6289326Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6289423Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6289491Z E       ^
2025-05-07T20:32:42.6289840Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6289888Z 
2025-05-07T20:32:42.6290300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6290304Z 
2025-05-07T20:32:42.6290400Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6290656Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6290729Z     T=16384,
2025-05-07T20:32:42.6290798Z     D=5120,
2025-05-07T20:32:42.6290879Z     scale_ub=None,
2025-05-07T20:32:42.6290962Z     contiguous=False,
2025-05-07T20:32:42.6291041Z     compiled=False,
2025-05-07T20:32:42.6291155Z )
2025-05-07T20:32:42.6291373Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6291543Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.6291547Z 
2025-05-07T20:32:42.6291622Z     @given(
2025-05-07T20:32:42.6291735Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6291835Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6291946Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6292059Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6292172Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6292244Z     )
2025-05-07T20:32:42.6292485Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6292575Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6292644Z         self,
2025-05-07T20:32:42.6292715Z         T: int,
2025-05-07T20:32:42.6292791Z         D: int,
2025-05-07T20:32:42.6292924Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6293009Z         contiguous: bool,
2025-05-07T20:32:42.6293092Z         compiled: bool,
2025-05-07T20:32:42.6293164Z     ) -> None:
2025-05-07T20:32:42.6293257Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6293325Z     
2025-05-07T20:32:42.6293489Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6293566Z     
2025-05-07T20:32:42.6293651Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6293768Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6295641Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.6295650Z 
2025-05-07T20:32:42.6295764Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:42.6295768Z 
2025-05-07T20:32:42.6295879Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6296136Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6296217Z     T=4096,
2025-05-07T20:32:42.6296287Z     D=7168,
2025-05-07T20:32:42.6296363Z     scale_ub=1200.0,
2025-05-07T20:32:42.6296445Z     contiguous=True,
2025-05-07T20:32:42.6296525Z     compiled=True,
2025-05-07T20:32:42.6296593Z )
2025-05-07T20:32:42.6296805Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6296969Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.6296974Z 
2025-05-07T20:32:42.6297048Z     @given(
2025-05-07T20:32:42.6297165Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6297259Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6297367Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6297481Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6297589Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6297707Z     )
2025-05-07T20:32:42.6297954Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6298043Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6298122Z         self,
2025-05-07T20:32:42.6298230Z         T: int,
2025-05-07T20:32:42.6298305Z         D: int,
2025-05-07T20:32:42.6298399Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6298483Z         contiguous: bool,
2025-05-07T20:32:42.6298561Z         compiled: bool,
2025-05-07T20:32:42.6298637Z     ) -> None:
2025-05-07T20:32:42.6298725Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6298835Z     
2025-05-07T20:32:42.6298999Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6299065Z     
2025-05-07T20:32:42.6299154Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6299272Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6301079Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.6301091Z 
2025-05-07T20:32:42.6301202Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:42.6301247Z 
2025-05-07T20:32:42.6301345Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6301570Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6301644Z     T=16384,
2025-05-07T20:32:42.6301713Z     D=7168,
2025-05-07T20:32:42.6301790Z     scale_ub=None,
2025-05-07T20:32:42.6301872Z     contiguous=False,
2025-05-07T20:32:42.6301952Z     compiled=False,
2025-05-07T20:32:42.6302026Z )
2025-05-07T20:32:42.6302234Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6302405Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.6302413Z 
2025-05-07T20:32:42.6302486Z     @given(
2025-05-07T20:32:42.6302596Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6302693Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6302803Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6302919Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6303035Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6303105Z     )
2025-05-07T20:32:42.6303349Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6303437Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6303511Z         self,
2025-05-07T20:32:42.6303591Z         T: int,
2025-05-07T20:32:42.6303662Z         D: int,
2025-05-07T20:32:42.6304051Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6304166Z         contiguous: bool,
2025-05-07T20:32:42.6304249Z         compiled: bool,
2025-05-07T20:32:42.6304320Z     ) -> None:
2025-05-07T20:32:42.6304414Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6304482Z     
2025-05-07T20:32:42.6304642Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6306485Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.6306583Z 
2025-05-07T20:32:42.6306700Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.6306704Z 
2025-05-07T20:32:42.6306807Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6307085Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6307166Z     T=2048,
2025-05-07T20:32:42.6307235Z     D=7168,
2025-05-07T20:32:42.6307310Z     scale_ub=1200.0,
2025-05-07T20:32:42.6307391Z     contiguous=True,
2025-05-07T20:32:42.6307466Z     compiled=True,
2025-05-07T20:32:42.6307535Z )
2025-05-07T20:32:42.6307819Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6307984Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.6307989Z 
2025-05-07T20:32:42.6308058Z     @given(
2025-05-07T20:32:42.6308172Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6308265Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6308379Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6308489Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6308595Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6308667Z     )
2025-05-07T20:32:42.6308905Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6308993Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6309070Z         self,
2025-05-07T20:32:42.6309145Z         T: int,
2025-05-07T20:32:42.6309213Z         D: int,
2025-05-07T20:32:42.6309370Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6309455Z         contiguous: bool,
2025-05-07T20:32:42.6309536Z         compiled: bool,
2025-05-07T20:32:42.6309611Z     ) -> None:
2025-05-07T20:32:42.6309698Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6309769Z     
2025-05-07T20:32:42.6309981Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6310051Z     
2025-05-07T20:32:42.6310140Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6310258Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6312025Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.6312039Z 
2025-05-07T20:32:42.6312151Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:42.6312156Z 
2025-05-07T20:32:42.6312254Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6312478Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6312550Z     T=2048,
2025-05-07T20:32:42.6312623Z     D=7168,
2025-05-07T20:32:42.6312701Z     scale_ub=None,
2025-05-07T20:32:42.6312779Z     contiguous=True,
2025-05-07T20:32:42.6312858Z     compiled=False,
2025-05-07T20:32:42.6312928Z )
2025-05-07T20:32:42.6313135Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6313305Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.6313310Z 
2025-05-07T20:32:42.6313379Z     @given(
2025-05-07T20:32:42.6313494Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6313595Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6313702Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6313813Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6313925Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6314065Z     )
2025-05-07T20:32:42.6314309Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6314399Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6314469Z         self,
2025-05-07T20:32:42.6314548Z         T: int,
2025-05-07T20:32:42.6314661Z         D: int,
2025-05-07T20:32:42.6314757Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6314844Z         contiguous: bool,
2025-05-07T20:32:42.6314923Z         compiled: bool,
2025-05-07T20:32:42.6314995Z     ) -> None:
2025-05-07T20:32:42.6315092Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6315160Z     
2025-05-07T20:32:42.6315365Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6315438Z     
2025-05-07T20:32:42.6315526Z >       x_sign = torch.sign(x)
2025-05-07T20:32:42.6317297Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.6317305Z 
2025-05-07T20:32:42.6317419Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:42.6317423Z 
2025-05-07T20:32:42.6317528Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6317791Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6317866Z     T=1,
2025-05-07T20:32:42.6317935Z     D=7168,
2025-05-07T20:32:42.6318012Z     scale_ub=1200.0,
2025-05-07T20:32:42.6318091Z     contiguous=True,
2025-05-07T20:32:42.6318171Z     compiled=False,
2025-05-07T20:32:42.6318238Z )
2025-05-07T20:32:42.6318447Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6318613Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.6318618Z 
2025-05-07T20:32:42.6318691Z     @given(
2025-05-07T20:32:42.6318809Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6318901Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6319010Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6319124Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6319231Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6319302Z     )
2025-05-07T20:32:42.6319553Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6319644Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6319716Z         self,
2025-05-07T20:32:42.6319793Z         T: int,
2025-05-07T20:32:42.6319861Z         D: int,
2025-05-07T20:32:42.6319955Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6320040Z         contiguous: bool,
2025-05-07T20:32:42.6320119Z         compiled: bool,
2025-05-07T20:32:42.6320197Z     ) -> None:
2025-05-07T20:32:42.6320287Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6320356Z     
2025-05-07T20:32:42.6320523Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6320591Z     
2025-05-07T20:32:42.6320676Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6320798Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6320881Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6320954Z         x0 = x[:, :D]
2025-05-07T20:32:42.6321039Z         x1 = x[:, D:]
2025-05-07T20:32:42.6321106Z     
2025-05-07T20:32:42.6321188Z         if contiguous:
2025-05-07T20:32:42.6321274Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6321356Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6321422Z     
2025-05-07T20:32:42.6321508Z         if scale_ub is not None:
2025-05-07T20:32:42.6321609Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6321790Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6321860Z             )
2025-05-07T20:32:42.6321934Z         else:
2025-05-07T20:32:42.6322027Z             scale_ub_tensor = None
2025-05-07T20:32:42.6322133Z     
2025-05-07T20:32:42.6322261Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6322350Z             op = silu_mul_quant
2025-05-07T20:32:42.6322431Z             if compiled:
2025-05-07T20:32:42.6322529Z                 op = torch.compile(op)
2025-05-07T20:32:42.6322635Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6322744Z     
2025-05-07T20:32:42.6322834Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6322838Z 
2025-05-07T20:32:42.6322928Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6323051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6323152Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6323249Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6323747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6323840Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6324195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6324419Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6324792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6324884Z     kernel = self.compile(
2025-05-07T20:32:42.6325264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6325433Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6325557Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6325566Z 
2025-05-07T20:32:42.6325769Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc7c5c0a0>
2025-05-07T20:32:42.6326549Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6327053Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc818f040>}
2025-05-07T20:32:42.6327795Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6327983Z context = <triton._C.libtriton.ir.context object at 0x7f9fc8191e30>
2025-05-07T20:32:42.6327990Z 
2025-05-07T20:32:42.6328149Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6328408Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6328515Z                            module_map=module_map)
2025-05-07T20:32:42.6328674Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6328771Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6328842Z E       ^
2025-05-07T20:32:42.6329197Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6329204Z 
2025-05-07T20:32:42.6329611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6329616Z 
2025-05-07T20:32:42.6329712Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6329932Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6330046Z     T=128,
2025-05-07T20:32:42.6330116Z     D=5120,
2025-05-07T20:32:42.6330196Z     scale_ub=None,
2025-05-07T20:32:42.6330274Z     contiguous=True,
2025-05-07T20:32:42.6330351Z     compiled=False,
2025-05-07T20:32:42.6330422Z )
2025-05-07T20:32:42.6330678Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6330846Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.6330850Z 
2025-05-07T20:32:42.6330925Z     @given(
2025-05-07T20:32:42.6331037Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6331176Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6331289Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6331400Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6331509Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6331576Z     )
2025-05-07T20:32:42.6331823Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6331917Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6331984Z         self,
2025-05-07T20:32:42.6332055Z         T: int,
2025-05-07T20:32:42.6332127Z         D: int,
2025-05-07T20:32:42.6332223Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6332305Z         contiguous: bool,
2025-05-07T20:32:42.6332389Z         compiled: bool,
2025-05-07T20:32:42.6332460Z     ) -> None:
2025-05-07T20:32:42.6332549Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6332617Z     
2025-05-07T20:32:42.6332819Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6332894Z     
2025-05-07T20:32:42.6332980Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6333099Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6333183Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6333258Z         x0 = x[:, :D]
2025-05-07T20:32:42.6333333Z         x1 = x[:, D:]
2025-05-07T20:32:42.6333402Z     
2025-05-07T20:32:42.6333480Z         if contiguous:
2025-05-07T20:32:42.6333563Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6333648Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6333715Z     
2025-05-07T20:32:42.6333799Z         if scale_ub is not None:
2025-05-07T20:32:42.6333908Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6334037Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6334114Z             )
2025-05-07T20:32:42.6334187Z         else:
2025-05-07T20:32:42.6334276Z             scale_ub_tensor = None
2025-05-07T20:32:42.6334352Z     
2025-05-07T20:32:42.6334475Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6334561Z             op = silu_mul_quant
2025-05-07T20:32:42.6334644Z             if compiled:
2025-05-07T20:32:42.6334738Z                 op = torch.compile(op)
2025-05-07T20:32:42.6334839Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6334912Z     
2025-05-07T20:32:42.6334998Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6335003Z 
2025-05-07T20:32:42.6335097Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6335225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6335323Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6335419Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6335920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6336014Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6336375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6336591Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6336930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6337064Z     kernel = self.compile(
2025-05-07T20:32:42.6337438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6337610Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6337771Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6337776Z 
2025-05-07T20:32:42.6337980Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc81bf2b0>
2025-05-07T20:32:42.6338759Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6339302Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc818fa60>}
2025-05-07T20:32:42.6340046Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6340235Z context = <triton._C.libtriton.ir.context object at 0x7f9fc7dea130>
2025-05-07T20:32:42.6340240Z 
2025-05-07T20:32:42.6340404Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6340661Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6340821Z                            module_map=module_map)
2025-05-07T20:32:42.6340982Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6341077Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6341150Z E       ^
2025-05-07T20:32:42.6341510Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6341519Z 
2025-05-07T20:32:42.6341925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6341930Z 
2025-05-07T20:32:42.6342027Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6342248Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6342320Z     T=128,
2025-05-07T20:32:42.6342390Z     D=7168,
2025-05-07T20:32:42.6342463Z     scale_ub=None,
2025-05-07T20:32:42.6342541Z     contiguous=True,
2025-05-07T20:32:42.6342623Z     compiled=False,
2025-05-07T20:32:42.6348162Z )
2025-05-07T20:32:42.6348417Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6348593Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.6348599Z 
2025-05-07T20:32:42.6348673Z     @given(
2025-05-07T20:32:42.6348789Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6348884Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6348997Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6349110Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6349219Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6349290Z     )
2025-05-07T20:32:42.6349539Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6349629Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6349701Z         self,
2025-05-07T20:32:42.6349775Z         T: int,
2025-05-07T20:32:42.6349901Z         D: int,
2025-05-07T20:32:42.6350003Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6350092Z         contiguous: bool,
2025-05-07T20:32:42.6350171Z         compiled: bool,
2025-05-07T20:32:42.6350242Z     ) -> None:
2025-05-07T20:32:42.6350336Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6350404Z     
2025-05-07T20:32:42.6350571Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6350711Z     
2025-05-07T20:32:42.6350798Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6350922Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6351005Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6351079Z         x0 = x[:, :D]
2025-05-07T20:32:42.6351196Z         x1 = x[:, D:]
2025-05-07T20:32:42.6351261Z     
2025-05-07T20:32:42.6351343Z         if contiguous:
2025-05-07T20:32:42.6351434Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6351522Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6351589Z     
2025-05-07T20:32:42.6351680Z         if scale_ub is not None:
2025-05-07T20:32:42.6351825Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6351964Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6352038Z             )
2025-05-07T20:32:42.6352112Z         else:
2025-05-07T20:32:42.6352204Z             scale_ub_tensor = None
2025-05-07T20:32:42.6352268Z     
2025-05-07T20:32:42.6352394Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6352486Z             op = silu_mul_quant
2025-05-07T20:32:42.6352567Z             if compiled:
2025-05-07T20:32:42.6352663Z                 op = torch.compile(op)
2025-05-07T20:32:42.6352770Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6352838Z     
2025-05-07T20:32:42.6352924Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6352929Z 
2025-05-07T20:32:42.6353024Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6353150Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6353292Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6353390Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6353894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6353991Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6354349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6354573Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6354915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6355004Z     kernel = self.compile(
2025-05-07T20:32:42.6355383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6355553Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6355682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6355687Z 
2025-05-07T20:32:42.6355892Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc81c3580>
2025-05-07T20:32:42.6356672Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6357188Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc7d85790>}
2025-05-07T20:32:42.6357931Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6358123Z context = <triton._C.libtriton.ir.context object at 0x7f9fc7d6d5b0>
2025-05-07T20:32:42.6358133Z 
2025-05-07T20:32:42.6358295Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6358554Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6358659Z                            module_map=module_map)
2025-05-07T20:32:42.6358860Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6358955Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6359031Z E       ^
2025-05-07T20:32:42.6359417Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6359423Z 
2025-05-07T20:32:42.6359834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6359839Z 
2025-05-07T20:32:42.6359936Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6360155Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6360271Z     T=2048,
2025-05-07T20:32:42.6360340Z     D=7168,
2025-05-07T20:32:42.6360416Z     scale_ub=1200.0,
2025-05-07T20:32:42.6360498Z     contiguous=True,
2025-05-07T20:32:42.6360578Z     compiled=False,
2025-05-07T20:32:42.6360647Z )
2025-05-07T20:32:42.6360862Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6361033Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.6361038Z 
2025-05-07T20:32:42.6361109Z     @given(
2025-05-07T20:32:42.6361222Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6361319Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6361433Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6361546Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6361655Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6361730Z     )
2025-05-07T20:32:42.6362012Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6362102Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6362175Z         self,
2025-05-07T20:32:42.6362244Z         T: int,
2025-05-07T20:32:42.6362313Z         D: int,
2025-05-07T20:32:42.6362410Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6362498Z         contiguous: bool,
2025-05-07T20:32:42.6362579Z         compiled: bool,
2025-05-07T20:32:42.6362655Z     ) -> None:
2025-05-07T20:32:42.6362745Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6362814Z     
2025-05-07T20:32:42.6362977Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6364764Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.6364785Z 
2025-05-07T20:32:42.6364898Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.6364905Z 
2025-05-07T20:32:42.6365003Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6365228Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6365298Z     T=1,
2025-05-07T20:32:42.6365367Z     D=5120,
2025-05-07T20:32:42.6365450Z     scale_ub=1200.0,
2025-05-07T20:32:42.6365532Z     contiguous=True,
2025-05-07T20:32:42.6365611Z     compiled=False,
2025-05-07T20:32:42.6365685Z )
2025-05-07T20:32:42.6365894Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6366077Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.6366086Z 
2025-05-07T20:32:42.6366164Z     @given(
2025-05-07T20:32:42.6366300Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6366399Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6366507Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6366668Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6366779Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6366847Z     )
2025-05-07T20:32:42.6367086Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6367218Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6367293Z         self,
2025-05-07T20:32:42.6367367Z         T: int,
2025-05-07T20:32:42.6367438Z         D: int,
2025-05-07T20:32:42.6367529Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6367618Z         contiguous: bool,
2025-05-07T20:32:42.6367697Z         compiled: bool,
2025-05-07T20:32:42.6367808Z     ) -> None:
2025-05-07T20:32:42.6367898Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6367966Z     
2025-05-07T20:32:42.6368125Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6368195Z     
2025-05-07T20:32:42.6368284Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6368400Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6368491Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6368565Z         x0 = x[:, :D]
2025-05-07T20:32:42.6368651Z         x1 = x[:, D:]
2025-05-07T20:32:42.6368719Z     
2025-05-07T20:32:42.6368796Z         if contiguous:
2025-05-07T20:32:42.6368887Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6368969Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6369037Z     
2025-05-07T20:32:42.6369126Z         if scale_ub is not None:
2025-05-07T20:32:42.6369226Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6369400Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6369479Z             )
2025-05-07T20:32:42.6369550Z         else:
2025-05-07T20:32:42.6369641Z             scale_ub_tensor = None
2025-05-07T20:32:42.6369708Z     
2025-05-07T20:32:42.6369835Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6369919Z             op = silu_mul_quant
2025-05-07T20:32:42.6370006Z             if compiled:
2025-05-07T20:32:42.6370101Z                 op = torch.compile(op)
2025-05-07T20:32:42.6370206Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6370272Z     
2025-05-07T20:32:42.6370358Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6370362Z 
2025-05-07T20:32:42.6370459Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6370583Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6370679Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6370774Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6371276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6371377Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6371732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6371956Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6372296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6372387Z     kernel = self.compile(
2025-05-07T20:32:42.6372763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6372936Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6373056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6373063Z 
2025-05-07T20:32:42.6373271Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc81a7520>
2025-05-07T20:32:42.6374049Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6374601Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc7ea3040>}
2025-05-07T20:32:42.6375381Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6375568Z context = <triton._C.libtriton.ir.context object at 0x7f9fc7e994b0>
2025-05-07T20:32:42.6375573Z 
2025-05-07T20:32:42.6375739Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6376061Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6376164Z                            module_map=module_map)
2025-05-07T20:32:42.6376321Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6376416Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6376494Z E       ^
2025-05-07T20:32:42.6376846Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6376851Z 
2025-05-07T20:32:42.6377261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6377265Z 
2025-05-07T20:32:42.6377364Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6377583Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6377660Z     T=2048,
2025-05-07T20:32:42.6377733Z     D=5120,
2025-05-07T20:32:42.6377850Z     scale_ub=None,
2025-05-07T20:32:42.6377940Z     contiguous=True,
2025-05-07T20:32:42.6378019Z     compiled=False,
2025-05-07T20:32:42.6378087Z )
2025-05-07T20:32:42.6378301Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6378470Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.6378479Z 
2025-05-07T20:32:42.6378555Z     @given(
2025-05-07T20:32:42.6378670Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6378765Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6378882Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6378992Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6379101Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6379172Z     )
2025-05-07T20:32:42.6379412Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6379512Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6379584Z         self,
2025-05-07T20:32:42.6379653Z         T: int,
2025-05-07T20:32:42.6379725Z         D: int,
2025-05-07T20:32:42.6379820Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6379904Z         contiguous: bool,
2025-05-07T20:32:42.6379983Z         compiled: bool,
2025-05-07T20:32:42.6380061Z     ) -> None:
2025-05-07T20:32:42.6380153Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6380224Z     
2025-05-07T20:32:42.6380386Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6380453Z     
2025-05-07T20:32:42.6380540Z >       x_sign = torch.sign(x)
2025-05-07T20:32:42.6382322Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.6382330Z 
2025-05-07T20:32:42.6382447Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:42.6382496Z 
2025-05-07T20:32:42.6382596Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6382814Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6382892Z     T=16384,
2025-05-07T20:32:42.6382964Z     D=5120,
2025-05-07T20:32:42.6383080Z     scale_ub=None,
2025-05-07T20:32:42.6383166Z     contiguous=True,
2025-05-07T20:32:42.6383243Z     compiled=False,
2025-05-07T20:32:42.6383315Z )
2025-05-07T20:32:42.6383531Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6383702Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.6383748Z 
2025-05-07T20:32:42.6383820Z     @given(
2025-05-07T20:32:42.6383933Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6384025Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6384140Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6384252Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6384364Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6384435Z     )
2025-05-07T20:32:42.6384676Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6384766Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6384838Z         self,
2025-05-07T20:32:42.6384908Z         T: int,
2025-05-07T20:32:42.6384983Z         D: int,
2025-05-07T20:32:42.6385076Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6385160Z         contiguous: bool,
2025-05-07T20:32:42.6385243Z         compiled: bool,
2025-05-07T20:32:42.6385317Z     ) -> None:
2025-05-07T20:32:42.6385449Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6385523Z     
2025-05-07T20:32:42.6385687Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6387471Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.6387479Z 
2025-05-07T20:32:42.6387592Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.6387597Z 
2025-05-07T20:32:42.6387696Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6387922Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6387995Z     T=4096,
2025-05-07T20:32:42.6388066Z     D=5120,
2025-05-07T20:32:42.6388141Z     scale_ub=None,
2025-05-07T20:32:42.6388220Z     contiguous=True,
2025-05-07T20:32:42.6388303Z     compiled=False,
2025-05-07T20:32:42.6388368Z )
2025-05-07T20:32:42.6388580Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6388750Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.6388755Z 
2025-05-07T20:32:42.6388825Z     @given(
2025-05-07T20:32:42.6388942Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6389033Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6389142Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6389254Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6389361Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6389432Z     )
2025-05-07T20:32:42.6389673Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6389761Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6389910Z         self,
2025-05-07T20:32:42.6389985Z         T: int,
2025-05-07T20:32:42.6390058Z         D: int,
2025-05-07T20:32:42.6390150Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6390285Z         contiguous: bool,
2025-05-07T20:32:42.6390365Z         compiled: bool,
2025-05-07T20:32:42.6390439Z     ) -> None:
2025-05-07T20:32:42.6390529Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6390598Z     
2025-05-07T20:32:42.6390805Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6392584Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.6392629Z 
2025-05-07T20:32:42.6392749Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.6392753Z 
2025-05-07T20:32:42.6392850Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6393071Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6393146Z     T=2048,
2025-05-07T20:32:42.6393218Z     D=5120,
2025-05-07T20:32:42.6393296Z     scale_ub=None,
2025-05-07T20:32:42.6393379Z     contiguous=False,
2025-05-07T20:32:42.6393456Z     compiled=False,
2025-05-07T20:32:42.6393528Z )
2025-05-07T20:32:42.6393738Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6393947Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.6393952Z 
2025-05-07T20:32:42.6394030Z     @given(
2025-05-07T20:32:42.6394141Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6394235Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6394347Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6394461Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6394568Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6394639Z     )
2025-05-07T20:32:42.6394881Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6394975Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6395042Z         self,
2025-05-07T20:32:42.6395114Z         T: int,
2025-05-07T20:32:42.6395194Z         D: int,
2025-05-07T20:32:42.6395287Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6395371Z         contiguous: bool,
2025-05-07T20:32:42.6395459Z         compiled: bool,
2025-05-07T20:32:42.6395532Z     ) -> None:
2025-05-07T20:32:42.6395621Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6395689Z     
2025-05-07T20:32:42.6395849Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6397624Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.6397632Z 
2025-05-07T20:32:42.6397744Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.6397751Z 
2025-05-07T20:32:42.6397854Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6398072Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6398141Z     T=4096,
2025-05-07T20:32:42.6398216Z     D=7168,
2025-05-07T20:32:42.6398292Z     scale_ub=None,
2025-05-07T20:32:42.6398369Z     contiguous=True,
2025-05-07T20:32:42.6398450Z     compiled=True,
2025-05-07T20:32:42.6398564Z )
2025-05-07T20:32:42.6398781Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6398947Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.6398951Z 
2025-05-07T20:32:42.6399059Z     @given(
2025-05-07T20:32:42.6399176Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6399270Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6399379Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6399490Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6399642Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6399713Z     )
2025-05-07T20:32:42.6399954Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6400044Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6400118Z         self,
2025-05-07T20:32:42.6400190Z         T: int,
2025-05-07T20:32:42.6400262Z         D: int,
2025-05-07T20:32:42.6400360Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6400446Z         contiguous: bool,
2025-05-07T20:32:42.6400526Z         compiled: bool,
2025-05-07T20:32:42.6400600Z     ) -> None:
2025-05-07T20:32:42.6400691Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6400756Z     
2025-05-07T20:32:42.6400921Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6402735Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.6402752Z 
2025-05-07T20:32:42.6402866Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.6402871Z 
2025-05-07T20:32:42.6402965Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6403193Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6403264Z     T=2048,
2025-05-07T20:32:42.6403337Z     D=5120,
2025-05-07T20:32:42.6403416Z     scale_ub=1200.0,
2025-05-07T20:32:42.6403496Z     contiguous=False,
2025-05-07T20:32:42.6403583Z     compiled=False,
2025-05-07T20:32:42.6403651Z )
2025-05-07T20:32:42.6404243Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6404423Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.6404428Z 
2025-05-07T20:32:42.6404499Z     @given(
2025-05-07T20:32:42.6404611Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6404709Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6404821Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6404930Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6405039Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6405108Z     )
2025-05-07T20:32:42.6405360Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6405449Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6405521Z         self,
2025-05-07T20:32:42.6405595Z         T: int,
2025-05-07T20:32:42.6405666Z         D: int,
2025-05-07T20:32:42.6405759Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6405853Z         contiguous: bool,
2025-05-07T20:32:42.6405933Z         compiled: bool,
2025-05-07T20:32:42.6406005Z     ) -> None:
2025-05-07T20:32:42.6406097Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6406166Z     
2025-05-07T20:32:42.6406338Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6408325Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.6408331Z 
2025-05-07T20:32:42.6408502Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.6408510Z 
2025-05-07T20:32:42.6408606Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6408828Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6408904Z     T=4096,
2025-05-07T20:32:42.6408975Z     D=7168,
2025-05-07T20:32:42.6409052Z     scale_ub=1200.0,
2025-05-07T20:32:42.6409139Z     contiguous=True,
2025-05-07T20:32:42.6409221Z     compiled=False,
2025-05-07T20:32:42.6409291Z )
2025-05-07T20:32:42.6409510Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6409677Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.6409682Z 
2025-05-07T20:32:42.6409758Z     @given(
2025-05-07T20:32:42.6409869Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6409960Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6410072Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6410241Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6410352Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6410421Z     )
2025-05-07T20:32:42.6410661Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6410749Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6410828Z         self,
2025-05-07T20:32:42.6410900Z         T: int,
2025-05-07T20:32:42.6410969Z         D: int,
2025-05-07T20:32:42.6411066Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6411148Z         contiguous: bool,
2025-05-07T20:32:42.6411235Z         compiled: bool,
2025-05-07T20:32:42.6411309Z     ) -> None:
2025-05-07T20:32:42.6411399Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6411467Z     
2025-05-07T20:32:42.6411629Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6413416Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.6413430Z 
2025-05-07T20:32:42.6413541Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.6413545Z 
2025-05-07T20:32:42.6413643Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6413867Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6413939Z     T=16384,
2025-05-07T20:32:42.6414010Z     D=7168,
2025-05-07T20:32:42.6414090Z     scale_ub=None,
2025-05-07T20:32:42.6414170Z     contiguous=False,
2025-05-07T20:32:42.6414251Z     compiled=True,
2025-05-07T20:32:42.6414322Z )
2025-05-07T20:32:42.6414530Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6414700Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.6414705Z 
2025-05-07T20:32:42.6414775Z     @given(
2025-05-07T20:32:42.6414885Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6415027Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6415135Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6415247Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6415398Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6415467Z     )
2025-05-07T20:32:42.6415710Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6415800Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6415872Z         self,
2025-05-07T20:32:42.6415948Z         T: int,
2025-05-07T20:32:42.6416058Z         D: int,
2025-05-07T20:32:42.6416150Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6416239Z         contiguous: bool,
2025-05-07T20:32:42.6416318Z         compiled: bool,
2025-05-07T20:32:42.6416388Z     ) -> None:
2025-05-07T20:32:42.6416483Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6416551Z     
2025-05-07T20:32:42.6416713Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6418543Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.6418551Z 
2025-05-07T20:32:42.6418662Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.6418671Z 
2025-05-07T20:32:42.6418767Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6418989Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6419067Z     T=4096,
2025-05-07T20:32:42.6419143Z     D=7168,
2025-05-07T20:32:42.6419223Z     scale_ub=None,
2025-05-07T20:32:42.6419307Z     contiguous=True,
2025-05-07T20:32:42.6419387Z     compiled=False,
2025-05-07T20:32:42.6419452Z )
2025-05-07T20:32:42.6419675Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6419841Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.6419845Z 
2025-05-07T20:32:42.6419921Z     @given(
2025-05-07T20:32:42.6420034Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6420128Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6420247Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6420358Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6420466Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6420541Z     )
2025-05-07T20:32:42.6420781Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6420871Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6420945Z         self,
2025-05-07T20:32:42.6421015Z         T: int,
2025-05-07T20:32:42.6421082Z         D: int,
2025-05-07T20:32:42.6421177Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6421262Z         contiguous: bool,
2025-05-07T20:32:42.6421348Z         compiled: bool,
2025-05-07T20:32:42.6421419Z     ) -> None:
2025-05-07T20:32:42.6421511Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6421581Z     
2025-05-07T20:32:42.6421742Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6423527Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.6423579Z 
2025-05-07T20:32:42.6423693Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.6423734Z 
2025-05-07T20:32:42.6423834Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6424057Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6424130Z     T=16384,
2025-05-07T20:32:42.6424204Z     D=7168,
2025-05-07T20:32:42.6424283Z     scale_ub=None,
2025-05-07T20:32:42.6424400Z     contiguous=True,
2025-05-07T20:32:42.6424483Z     compiled=False,
2025-05-07T20:32:42.6424553Z )
2025-05-07T20:32:42.6424761Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6424936Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.6424941Z 
2025-05-07T20:32:42.6425017Z     @given(
2025-05-07T20:32:42.6425128Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6425223Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6425333Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6425447Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6425557Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6425628Z     )
2025-05-07T20:32:42.6425873Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6425959Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6426093Z         self,
2025-05-07T20:32:42.6426175Z         T: int,
2025-05-07T20:32:42.6426265Z         D: int,
2025-05-07T20:32:42.6426361Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6426445Z         contiguous: bool,
2025-05-07T20:32:42.6426526Z         compiled: bool,
2025-05-07T20:32:42.6426599Z     ) -> None:
2025-05-07T20:32:42.6426692Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6426758Z     
2025-05-07T20:32:42.6426920Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6428707Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.6428715Z 
2025-05-07T20:32:42.6428829Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.6428838Z 
2025-05-07T20:32:42.6428935Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6429154Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6429233Z     T=16384,
2025-05-07T20:32:42.6429302Z     D=7168,
2025-05-07T20:32:42.6429378Z     scale_ub=1200.0,
2025-05-07T20:32:42.6429459Z     contiguous=True,
2025-05-07T20:32:42.6429536Z     compiled=False,
2025-05-07T20:32:42.6429604Z )
2025-05-07T20:32:42.6429872Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6430042Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.6430046Z 
2025-05-07T20:32:42.6430122Z     @given(
2025-05-07T20:32:42.6430235Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6430332Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6430445Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6430556Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6430663Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6430734Z     )
2025-05-07T20:32:42.6431023Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6431112Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6431188Z         self,
2025-05-07T20:32:42.6431258Z         T: int,
2025-05-07T20:32:42.6431334Z         D: int,
2025-05-07T20:32:42.6431470Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6431555Z         contiguous: bool,
2025-05-07T20:32:42.6431638Z         compiled: bool,
2025-05-07T20:32:42.6431708Z     ) -> None:
2025-05-07T20:32:42.6431797Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6431868Z     
2025-05-07T20:32:42.6432031Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6433854Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.6433865Z 
2025-05-07T20:32:42.6433980Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.6433985Z 
2025-05-07T20:32:42.6434083Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6434304Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6434375Z     T=128,
2025-05-07T20:32:42.6434488Z     D=5120,
2025-05-07T20:32:42.6434569Z     scale_ub=1200.0,
2025-05-07T20:32:42.6434646Z     contiguous=False,
2025-05-07T20:32:42.6434726Z     compiled=False,
2025-05-07T20:32:42.6434795Z )
2025-05-07T20:32:42.6435010Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6435176Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.6435183Z 
2025-05-07T20:32:42.6435256Z     @given(
2025-05-07T20:32:42.6435367Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6435460Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6435571Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6435680Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6435793Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6435863Z     )
2025-05-07T20:32:42.6436106Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6436202Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6436271Z         self,
2025-05-07T20:32:42.6436345Z         T: int,
2025-05-07T20:32:42.6436416Z         D: int,
2025-05-07T20:32:42.6436507Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6436592Z         contiguous: bool,
2025-05-07T20:32:42.6436671Z         compiled: bool,
2025-05-07T20:32:42.6436743Z     ) -> None:
2025-05-07T20:32:42.6436834Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6436903Z     
2025-05-07T20:32:42.6437065Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6437139Z     
2025-05-07T20:32:42.6437230Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6437352Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6437436Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6437512Z         x0 = x[:, :D]
2025-05-07T20:32:42.6437589Z         x1 = x[:, D:]
2025-05-07T20:32:42.6437658Z     
2025-05-07T20:32:42.6437735Z         if contiguous:
2025-05-07T20:32:42.6437834Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6437921Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6437987Z     
2025-05-07T20:32:42.6438081Z         if scale_ub is not None:
2025-05-07T20:32:42.6438182Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6438313Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6438458Z             )
2025-05-07T20:32:42.6438530Z         else:
2025-05-07T20:32:42.6438623Z             scale_ub_tensor = None
2025-05-07T20:32:42.6438691Z     
2025-05-07T20:32:42.6438815Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6438944Z             op = silu_mul_quant
2025-05-07T20:32:42.6439024Z             if compiled:
2025-05-07T20:32:42.6439120Z                 op = torch.compile(op)
2025-05-07T20:32:42.6439226Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6439295Z     
2025-05-07T20:32:42.6439381Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6439426Z 
2025-05-07T20:32:42.6439522Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6439645Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6439741Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6439838Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6440335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6440432Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6440786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6441008Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6441346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6441434Z     kernel = self.compile(
2025-05-07T20:32:42.6441854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6442023Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6442144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6442148Z 
2025-05-07T20:32:42.6442355Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc7bc4190>
2025-05-07T20:32:42.6443138Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6443648Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc7bc5ca0>}
2025-05-07T20:32:42.6444394Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6444588Z context = <triton._C.libtriton.ir.context object at 0x7f9fc7b88df0>
2025-05-07T20:32:42.6444593Z 
2025-05-07T20:32:42.6444755Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6445018Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6445122Z                            module_map=module_map)
2025-05-07T20:32:42.6445279Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6445375Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6445453Z E       ^
2025-05-07T20:32:42.6445818Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6445824Z 
2025-05-07T20:32:42.6446272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6446280Z 
2025-05-07T20:32:42.6446376Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6446592Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6446671Z     T=2048,
2025-05-07T20:32:42.6446738Z     D=7168,
2025-05-07T20:32:42.6446858Z     scale_ub=None,
2025-05-07T20:32:42.6446945Z     contiguous=False,
2025-05-07T20:32:42.6447026Z     compiled=False,
2025-05-07T20:32:42.6447092Z )
2025-05-07T20:32:42.6447306Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6447510Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.6447515Z 
2025-05-07T20:32:42.6447588Z     @given(
2025-05-07T20:32:42.6447705Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6447797Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6447913Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6448068Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6448178Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6448250Z     )
2025-05-07T20:32:42.6448492Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6448580Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6448660Z         self,
2025-05-07T20:32:42.6448732Z         T: int,
2025-05-07T20:32:42.6448805Z         D: int,
2025-05-07T20:32:42.6448899Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6448984Z         contiguous: bool,
2025-05-07T20:32:42.6449067Z         compiled: bool,
2025-05-07T20:32:42.6449138Z     ) -> None:
2025-05-07T20:32:42.6449229Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6449302Z     
2025-05-07T20:32:42.6449465Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6451286Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.6451300Z 
2025-05-07T20:32:42.6451413Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.6451418Z 
2025-05-07T20:32:42.6451521Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6451740Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6451811Z     T=128,
2025-05-07T20:32:42.6451882Z     D=7168,
2025-05-07T20:32:42.6451960Z     scale_ub=1200.0,
2025-05-07T20:32:42.6452039Z     contiguous=True,
2025-05-07T20:32:42.6452124Z     compiled=True,
2025-05-07T20:32:42.6452196Z )
2025-05-07T20:32:42.6452406Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6452569Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.6452574Z 
2025-05-07T20:32:42.6452645Z     @given(
2025-05-07T20:32:42.6452756Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6452856Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6452964Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6453078Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6453188Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6453258Z     )
2025-05-07T20:32:42.6453502Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6453590Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6453662Z         self,
2025-05-07T20:32:42.6453738Z         T: int,
2025-05-07T20:32:42.6453809Z         D: int,
2025-05-07T20:32:42.6453902Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6453990Z         contiguous: bool,
2025-05-07T20:32:42.6454068Z         compiled: bool,
2025-05-07T20:32:42.6454141Z     ) -> None:
2025-05-07T20:32:42.6454231Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6454297Z     
2025-05-07T20:32:42.6454505Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6454578Z     
2025-05-07T20:32:42.6454665Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6454788Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6454923Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6454998Z         x0 = x[:, :D]
2025-05-07T20:32:42.6455078Z         x1 = x[:, D:]
2025-05-07T20:32:42.6455146Z     
2025-05-07T20:32:42.6455225Z         if contiguous:
2025-05-07T20:32:42.6455317Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6455400Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6455512Z     
2025-05-07T20:32:42.6455602Z         if scale_ub is not None:
2025-05-07T20:32:42.6455704Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6455834Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6455909Z             )
2025-05-07T20:32:42.6455981Z         else:
2025-05-07T20:32:42.6456070Z             scale_ub_tensor = None
2025-05-07T20:32:42.6456142Z     
2025-05-07T20:32:42.6456266Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6456352Z             op = silu_mul_quant
2025-05-07T20:32:42.6456431Z             if compiled:
2025-05-07T20:32:42.6456528Z                 op = torch.compile(op)
2025-05-07T20:32:42.6456631Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6456699Z     
2025-05-07T20:32:42.6456785Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6456789Z 
2025-05-07T20:32:42.6456885Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6457048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6457153Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6457246Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6457610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6457700Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6458197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6458288Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6458649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6458871Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6459206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6459299Z     kernel = self.compile(
2025-05-07T20:32:42.6459675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6459848Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6459968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6459975Z 
2025-05-07T20:32:42.6460176Z self = <triton.compiler.compiler.ASTSource object at 0x7f9fc7ae3850>
2025-05-07T20:32:42.6460960Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6461466Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9fbc0c48b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9fc7b390d0>}
2025-05-07T20:32:42.6462215Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6462404Z context = <triton._C.libtriton.ir.context object at 0x7f9fc7ab9ab0>
2025-05-07T20:32:42.6462409Z 
2025-05-07T20:32:42.6462575Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6462879Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6462980Z                            module_map=module_map)
2025-05-07T20:32:42.6463179Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6463273Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6463343Z E       ^
2025-05-07T20:32:42.6463697Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6463746Z 
2025-05-07T20:32:42.6464160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6464165Z 
2025-05-07T20:32:42.6464265Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6464481Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6464556Z     T=128,
2025-05-07T20:32:42.6464633Z     D=7168,
2025-05-07T20:32:42.6464711Z     scale_ub=1200.0,
2025-05-07T20:32:42.6464789Z     contiguous=True,
2025-05-07T20:32:42.6464871Z     compiled=False,
2025-05-07T20:32:42.6464936Z )
2025-05-07T20:32:42.6465153Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6465316Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.6465321Z 
2025-05-07T20:32:42.6465391Z     @given(
2025-05-07T20:32:42.6465507Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6465665Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6465780Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6465896Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6466004Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6466078Z     )
2025-05-07T20:32:42.6466320Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6466424Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6466511Z         self,
2025-05-07T20:32:42.6466589Z         T: int,
2025-05-07T20:32:42.6466672Z         D: int,
2025-05-07T20:32:42.6466768Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6466856Z         contiguous: bool,
2025-05-07T20:32:42.6466938Z         compiled: bool,
2025-05-07T20:32:42.6472339Z     ) -> None:
2025-05-07T20:32:42.6472449Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6472518Z     
2025-05-07T20:32:42.6472692Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6472770Z     
2025-05-07T20:32:42.6472863Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6472989Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6474797Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.6474805Z 
2025-05-07T20:32:42.6474926Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:42.6474931Z 
2025-05-07T20:32:42.6475031Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6475261Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6475343Z     T=128,
2025-05-07T20:32:42.6475416Z     D=5120,
2025-05-07T20:32:42.6475497Z     scale_ub=1200.0,
2025-05-07T20:32:42.6475575Z     contiguous=True,
2025-05-07T20:32:42.6475651Z     compiled=True,
2025-05-07T20:32:42.6475723Z )
2025-05-07T20:32:42.6475940Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6476175Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.6476180Z 
2025-05-07T20:32:42.6476252Z     @given(
2025-05-07T20:32:42.6476367Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6476500Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6476618Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6476731Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6476844Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6476949Z     )
2025-05-07T20:32:42.6477194Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6477287Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6477359Z         self,
2025-05-07T20:32:42.6477429Z         T: int,
2025-05-07T20:32:42.6477505Z         D: int,
2025-05-07T20:32:42.6477597Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6477684Z         contiguous: bool,
2025-05-07T20:32:42.6477765Z         compiled: bool,
2025-05-07T20:32:42.6477842Z     ) -> None:
2025-05-07T20:32:42.6477931Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6478005Z     
2025-05-07T20:32:42.6478172Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6478241Z     
2025-05-07T20:32:42.6478329Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6478447Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6480257Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.6480268Z 
2025-05-07T20:32:42.6480382Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:42.6480387Z 
2025-05-07T20:32:42.6480486Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6480711Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6480784Z     T=128,
2025-05-07T20:32:42.6480860Z     D=7168,
2025-05-07T20:32:42.6480940Z     scale_ub=None,
2025-05-07T20:32:42.6481017Z     contiguous=True,
2025-05-07T20:32:42.6481096Z     compiled=True,
2025-05-07T20:32:42.6481167Z )
2025-05-07T20:32:42.6481390Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6481551Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.6481556Z 
2025-05-07T20:32:42.6481629Z     @given(
2025-05-07T20:32:42.6481744Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6481838Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6481947Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6482060Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6482169Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6482236Z     )
2025-05-07T20:32:42.6482485Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6482572Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6482646Z         self,
2025-05-07T20:32:42.6482717Z         T: int,
2025-05-07T20:32:42.6482786Z         D: int,
2025-05-07T20:32:42.6482890Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6482973Z         contiguous: bool,
2025-05-07T20:32:42.6483053Z         compiled: bool,
2025-05-07T20:32:42.6483126Z     ) -> None:
2025-05-07T20:32:42.6483215Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6483282Z     
2025-05-07T20:32:42.6483445Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6485290Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.6485339Z 
2025-05-07T20:32:42.6485455Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.6485585Z =============================== warnings summary ===============================
2025-05-07T20:32:42.6485890Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:42.6486213Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:42.6486528Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:42.6487426Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:42.6487650Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:42.6487694Z 
2025-05-07T20:32:42.6487904Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:42.6488067Z ================= 1 failed, 1 deselected, 3 warnings in 19.29s =================
2025-05-07T20:32:44.1746478Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:44.2362988Z [EXEC] [ATTEMPT 1/2] Command attempt failed.
2025-05-07T20:32:44.2363252Z 
2025-05-07T20:32:46.2379924Z [EXEC] [ATTEMPT 2/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:48.3950088Z ============================= test session starts ==============================
2025-05-07T20:32:48.3950759Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:48.3951289Z cachedir: .pytest_cache
2025-05-07T20:32:48.3951861Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:48.3952567Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:48.3952982Z plugins: hypothesis-6.131.14
2025-05-07T20:32:49.9940148Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:50.2068882Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:50.2069279Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:50.2069512Z 
2025-05-07T20:32:52.8653435Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.8654240Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.8654657Z     T=1,
2025-05-07T20:32:52.8654869Z     D=5120,
2025-05-07T20:32:52.8655075Z     scale_ub=None,
2025-05-07T20:32:52.8655293Z     contiguous=True,
2025-05-07T20:32:52.8655517Z     compiled=True,
2025-05-07T20:32:52.8655734Z )
2025-05-07T20:32:52.8656059Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.8656545Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:52.8657131Z 
2025-05-07T20:32:52.8657212Z     @given(
2025-05-07T20:32:52.8657447Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.8657763Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.8658171Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.8658535Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.8658893Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.8659180Z     )
2025-05-07T20:32:52.8659537Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.8660070Z     def test_silu_mul_quant(
2025-05-07T20:32:52.8660311Z         self,
2025-05-07T20:32:52.8660511Z         T: int,
2025-05-07T20:32:52.8660712Z         D: int,
2025-05-07T20:32:52.8660926Z         scale_ub: Optional[float],
2025-05-07T20:32:52.8661199Z         contiguous: bool,
2025-05-07T20:32:52.8661440Z         compiled: bool,
2025-05-07T20:32:52.8661669Z     ) -> None:
2025-05-07T20:32:52.8661884Z         torch.manual_seed(2025)
2025-05-07T20:32:52.8662133Z     
2025-05-07T20:32:52.8662407Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.8662747Z     
2025-05-07T20:32:52.8662942Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.8663239Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.8663543Z         x = x_sign * x_clamp
2025-05-07T20:32:52.8663794Z         x0 = x[:, :D]
2025-05-07T20:32:52.8664014Z         x1 = x[:, D:]
2025-05-07T20:32:52.8664217Z     
2025-05-07T20:32:52.8664407Z         if contiguous:
2025-05-07T20:32:52.8664738Z             x0 = x0.contiguous()
2025-05-07T20:32:52.8664996Z             x1 = x1.contiguous()
2025-05-07T20:32:52.8665239Z     
2025-05-07T20:32:52.8665432Z         if scale_ub is not None:
2025-05-07T20:32:52.8665703Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.8666046Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.8666362Z             )
2025-05-07T20:32:52.8666560Z         else:
2025-05-07T20:32:52.8666768Z             scale_ub_tensor = None
2025-05-07T20:32:52.8667022Z     
2025-05-07T20:32:52.8667255Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.8667566Z             op = silu_mul_quant
2025-05-07T20:32:52.8667824Z             if compiled:
2025-05-07T20:32:52.8668079Z                 op = torch.compile(op)
2025-05-07T20:32:52.8668374Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.8668655Z     
2025-05-07T20:32:52.8668850Z         y_fp8, y_scale = fn()
2025-05-07T20:32:52.8669141Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:52.8669437Z     
2025-05-07T20:32:52.8669677Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.8670087Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:52.8670385Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:52.8670704Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:52.8671070Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.8671375Z     
2025-05-07T20:32:52.8671581Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:52.8671776Z 
2025-05-07T20:32:52.8671885Z moe/activation_test.py:126: 
2025-05-07T20:32:52.8672182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.8672523Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:52.8672855Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.8673644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:52.8674420Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:52.8674968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.8675652Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.8676390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:52.8677151Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:52.8677910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:52.8678707Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:52.8679432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:52.8680113Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:52.8680716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:52.8681234Z     fn()
2025-05-07T20:32:52.8681736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:52.8682324Z     self.fn.run(
2025-05-07T20:32:52.8682792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.8683322Z     kernel = self.compile(
2025-05-07T20:32:52.8683862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.8684517Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.8684959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.8685209Z 
2025-05-07T20:32:52.8685424Z self = <triton.compiler.compiler.ASTSource object at 0x7f891b2b8040>
2025-05-07T20:32:52.8686514Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.8687921Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f891b3dd9d0>}
2025-05-07T20:32:52.8689317Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.8690339Z context = <triton._C.libtriton.ir.context object at 0x7f891b8f98b0>
2025-05-07T20:32:52.8690630Z 
2025-05-07T20:32:52.8690808Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.8691326Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.8691800Z                            module_map=module_map)
2025-05-07T20:32:52.8692171Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.8692527Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:52.8692797Z E       ^
2025-05-07T20:32:52.8693272Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.8693724Z 
2025-05-07T20:32:52.8694147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.8694657Z 
2025-05-07T20:32:52.8694762Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.8695182Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.8695594Z     T=2048,
2025-05-07T20:32:52.8695785Z     D=5120,
2025-05-07T20:32:52.8695984Z     scale_ub=1200.0,
2025-05-07T20:32:52.8696210Z     contiguous=True,
2025-05-07T20:32:52.8696428Z     compiled=False,
2025-05-07T20:32:52.8696634Z )
2025-05-07T20:32:54.3518174Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.3519253Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.3519616Z 
2025-05-07T20:32:54.3519705Z     @given(
2025-05-07T20:32:54.3519944Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.3520350Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.3520663Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.3521000Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.3521324Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.3521611Z     )
2025-05-07T20:32:54.3522058Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.3522499Z     def test_silu_mul_quant(
2025-05-07T20:32:54.3522748Z         self,
2025-05-07T20:32:54.3522945Z         T: int,
2025-05-07T20:32:54.3523137Z         D: int,
2025-05-07T20:32:54.3523357Z         scale_ub: Optional[float],
2025-05-07T20:32:54.3523633Z         contiguous: bool,
2025-05-07T20:32:54.3523872Z         compiled: bool,
2025-05-07T20:32:54.3524102Z     ) -> None:
2025-05-07T20:32:54.3524322Z         torch.manual_seed(2025)
2025-05-07T20:32:54.3524563Z     
2025-05-07T20:32:54.3524843Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.3525193Z     
2025-05-07T20:32:54.3525388Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.3525675Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.3525995Z         x = x_sign * x_clamp
2025-05-07T20:32:54.3526237Z         x0 = x[:, :D]
2025-05-07T20:32:54.3526447Z         x1 = x[:, D:]
2025-05-07T20:32:54.3527088Z     
2025-05-07T20:32:54.3527281Z         if contiguous:
2025-05-07T20:32:54.3527517Z             x0 = x0.contiguous()
2025-05-07T20:32:54.3527778Z             x1 = x1.contiguous()
2025-05-07T20:32:54.3528025Z     
2025-05-07T20:32:54.3528213Z         if scale_ub is not None:
2025-05-07T20:32:54.3528491Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.3528835Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.3529139Z             )
2025-05-07T20:32:54.3529335Z         else:
2025-05-07T20:32:54.3529545Z             scale_ub_tensor = None
2025-05-07T20:32:54.3529794Z     
2025-05-07T20:32:54.3530036Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.3530355Z             op = silu_mul_quant
2025-05-07T20:32:54.3530605Z             if compiled:
2025-05-07T20:32:54.3530853Z                 op = torch.compile(op)
2025-05-07T20:32:54.3531154Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.3531437Z     
2025-05-07T20:32:54.3531624Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.3531800Z 
2025-05-07T20:32:54.3531900Z moe/activation_test.py:117: 
2025-05-07T20:32:54.3532196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.3532528Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.3532814Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.3533518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.3534216Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.3534754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.3535446Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.3536110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.3536644Z     kernel = self.compile(
2025-05-07T20:32:54.3537192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.3537856Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.3538256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.3538535Z 
2025-05-07T20:32:54.3538744Z self = <triton.compiler.compiler.ASTSource object at 0x7f8a6d296e50>
2025-05-07T20:32:54.3539881Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.3541298Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f88f9ced5e0>}
2025-05-07T20:32:54.3542697Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.3543724Z context = <triton._C.libtriton.ir.context object at 0x7f8919f41870>
2025-05-07T20:32:54.3544012Z 
2025-05-07T20:32:54.3544186Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.3544715Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.3545194Z                            module_map=module_map)
2025-05-07T20:32:54.3545562Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.3545927Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.3546189Z E       ^
2025-05-07T20:32:54.3546661Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.3547161Z 
2025-05-07T20:32:54.3547581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.3548101Z 
2025-05-07T20:32:54.3548205Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.3548663Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.3549079Z     T=2048,
2025-05-07T20:32:54.3549276Z     D=5120,
2025-05-07T20:32:54.3549472Z     scale_ub=1200.0,
2025-05-07T20:32:54.3549694Z     contiguous=True,
2025-05-07T20:32:54.3549994Z     compiled=True,
2025-05-07T20:32:54.3550205Z )
2025-05-07T20:32:54.3550530Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.3551024Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.3551299Z 
2025-05-07T20:32:54.3551375Z     @given(
2025-05-07T20:32:54.3551605Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.3551919Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.3552232Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.3552565Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.3552891Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.3553180Z     )
2025-05-07T20:32:54.3553535Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.3553983Z     def test_silu_mul_quant(
2025-05-07T20:32:54.3554218Z         self,
2025-05-07T20:32:54.3554416Z         T: int,
2025-05-07T20:32:54.3554617Z         D: int,
2025-05-07T20:32:54.3554830Z         scale_ub: Optional[float],
2025-05-07T20:32:54.3555113Z         contiguous: bool,
2025-05-07T20:32:54.3555355Z         compiled: bool,
2025-05-07T20:32:54.3555575Z     ) -> None:
2025-05-07T20:32:54.3555798Z         torch.manual_seed(2025)
2025-05-07T20:32:54.3556044Z     
2025-05-07T20:32:54.3556309Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.3556667Z     
2025-05-07T20:32:54.3556871Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.3557158Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.3557470Z         x = x_sign * x_clamp
2025-05-07T20:32:54.3557716Z         x0 = x[:, :D]
2025-05-07T20:32:54.3557929Z         x1 = x[:, D:]
2025-05-07T20:32:54.3558145Z     
2025-05-07T20:32:54.3558386Z         if contiguous:
2025-05-07T20:32:54.3558638Z             x0 = x0.contiguous()
2025-05-07T20:32:54.3558930Z             x1 = x1.contiguous()
2025-05-07T20:32:54.3559173Z     
2025-05-07T20:32:54.3559368Z         if scale_ub is not None:
2025-05-07T20:32:54.3559684Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.3560028Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.3560349Z             )
2025-05-07T20:32:54.3560538Z         else:
2025-05-07T20:32:54.3560751Z             scale_ub_tensor = None
2025-05-07T20:32:54.3561005Z     
2025-05-07T20:32:54.3561273Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.3561593Z             op = silu_mul_quant
2025-05-07T20:32:54.3561856Z             if compiled:
2025-05-07T20:32:54.3562106Z                 op = torch.compile(op)
2025-05-07T20:32:54.3562408Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.3562694Z     
2025-05-07T20:32:54.3562887Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.3563177Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.3563474Z     
2025-05-07T20:32:54.3563704Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.3564048Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.3564344Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.3564665Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.3565022Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.3565331Z     
2025-05-07T20:32:54.3565580Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.3565780Z 
2025-05-07T20:32:54.3565879Z moe/activation_test.py:126: 
2025-05-07T20:32:54.3566177Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.3566513Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.3566854Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.3567640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.3568397Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.3568949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.3569628Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.3570318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.3571051Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.3571807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:54.3572548Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.3573281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.3573924Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.3574529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.3575040Z     fn()
2025-05-07T20:32:54.3575546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.3576136Z     self.fn.run(
2025-05-07T20:32:54.3576608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.3577141Z     kernel = self.compile(
2025-05-07T20:32:54.3577686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.3578344Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.3578839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.3579080Z 
2025-05-07T20:32:54.3579292Z self = <triton.compiler.compiler.ASTSource object at 0x7f891b477b50>
2025-05-07T20:32:54.3580435Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.3581844Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8919e54160>}
2025-05-07T20:32:54.3583260Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.3584294Z context = <triton._C.libtriton.ir.context object at 0x7f8919cd01f0>
2025-05-07T20:32:54.3584604Z 
2025-05-07T20:32:54.3584771Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.3585310Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.3585778Z                            module_map=module_map)
2025-05-07T20:32:54.3586157Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.3586525Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.3586799Z E       ^
2025-05-07T20:32:54.3587300Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.3587770Z 
2025-05-07T20:32:54.3588191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.3588711Z 
2025-05-07T20:32:54.3588824Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.3589298Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.3589701Z     T=16384,
2025-05-07T20:32:54.3589941Z     D=7168,
2025-05-07T20:32:54.3590130Z     scale_ub=1200.0,
2025-05-07T20:32:54.3590352Z     contiguous=False,
2025-05-07T20:32:54.3590589Z     compiled=False,
2025-05-07T20:32:54.3590801Z )
2025-05-07T20:32:55.6894992Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.6895732Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:55.6896174Z 
2025-05-07T20:32:55.6896299Z     @given(
2025-05-07T20:32:55.6896550Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.6896863Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.6897176Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.6897514Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.6897840Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.6898144Z     )
2025-05-07T20:32:55.6898503Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.6898997Z     def test_silu_mul_quant(
2025-05-07T20:32:55.6899241Z         self,
2025-05-07T20:32:55.6899449Z         T: int,
2025-05-07T20:32:55.6899652Z         D: int,
2025-05-07T20:32:55.6899870Z         scale_ub: Optional[float],
2025-05-07T20:32:55.6900152Z         contiguous: bool,
2025-05-07T20:32:55.6900396Z         compiled: bool,
2025-05-07T20:32:55.6900619Z     ) -> None:
2025-05-07T20:32:55.6900843Z         torch.manual_seed(2025)
2025-05-07T20:32:55.6901098Z     
2025-05-07T20:32:55.6901370Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.6901726Z     
2025-05-07T20:32:55.6901927Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.6902219Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.6902540Z         x = x_sign * x_clamp
2025-05-07T20:32:55.6902962Z         x0 = x[:, :D]
2025-05-07T20:32:55.6903173Z         x1 = x[:, D:]
2025-05-07T20:32:55.6903387Z     
2025-05-07T20:32:55.6903577Z         if contiguous:
2025-05-07T20:32:55.6903988Z             x0 = x0.contiguous()
2025-05-07T20:32:55.6904338Z             x1 = x1.contiguous()
2025-05-07T20:32:55.6904694Z     
2025-05-07T20:32:55.6904891Z         if scale_ub is not None:
2025-05-07T20:32:55.6905164Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.6905508Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.6905824Z             )
2025-05-07T20:32:55.6906012Z         else:
2025-05-07T20:32:55.6906305Z             scale_ub_tensor = None
2025-05-07T20:32:55.6906572Z     
2025-05-07T20:32:55.6906801Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.6907119Z             op = silu_mul_quant
2025-05-07T20:32:55.6907375Z             if compiled:
2025-05-07T20:32:55.6907619Z                 op = torch.compile(op)
2025-05-07T20:32:55.6907925Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.6908202Z     
2025-05-07T20:32:55.6908390Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.6908565Z 
2025-05-07T20:32:55.6908667Z moe/activation_test.py:117: 
2025-05-07T20:32:55.6908994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.6909354Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.6909634Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.6910528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.6911448Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.6911990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.6912673Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.6913341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.6913877Z     kernel = self.compile(
2025-05-07T20:32:55.6914413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.6915073Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.6915474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.6915705Z 
2025-05-07T20:32:55.6915916Z self = <triton.compiler.compiler.ASTSource object at 0x7f8919e7adc0>
2025-05-07T20:32:55.6917007Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.6918404Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8919dffe50>}
2025-05-07T20:32:55.6919766Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.6920797Z context = <triton._C.libtriton.ir.context object at 0x7f8919c16270>
2025-05-07T20:32:55.6921087Z 
2025-05-07T20:32:55.6921265Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.6921794Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.6922261Z                            module_map=module_map)
2025-05-07T20:32:55.6922630Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.6922978Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.6923242Z E       ^
2025-05-07T20:32:55.6923721Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.6924241Z 
2025-05-07T20:32:55.6924663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.6925173Z 
2025-05-07T20:32:55.6925366Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.6925791Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.6926193Z     T=1,
2025-05-07T20:32:55.6926375Z     D=7168,
2025-05-07T20:32:55.6926566Z     scale_ub=None,
2025-05-07T20:32:55.6926786Z     contiguous=True,
2025-05-07T20:32:55.6927046Z     compiled=True,
2025-05-07T20:32:55.6927249Z )
2025-05-07T20:32:55.6927570Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.6928048Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:55.6928312Z 
2025-05-07T20:32:55.6928389Z     @given(
2025-05-07T20:32:55.6928624Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.6928995Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.6929297Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.6929627Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.6929958Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.6930240Z     )
2025-05-07T20:32:55.6930587Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.6931028Z     def test_silu_mul_quant(
2025-05-07T20:32:55.6931266Z         self,
2025-05-07T20:32:55.6931460Z         T: int,
2025-05-07T20:32:55.6931709Z         D: int,
2025-05-07T20:32:55.6931927Z         scale_ub: Optional[float],
2025-05-07T20:32:55.6932198Z         contiguous: bool,
2025-05-07T20:32:55.6932441Z         compiled: bool,
2025-05-07T20:32:55.6932666Z     ) -> None:
2025-05-07T20:32:55.6932877Z         torch.manual_seed(2025)
2025-05-07T20:32:55.6933124Z     
2025-05-07T20:32:55.6933392Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.6933731Z     
2025-05-07T20:32:55.6933931Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.6934226Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.6934536Z         x = x_sign * x_clamp
2025-05-07T20:32:55.6934785Z         x0 = x[:, :D]
2025-05-07T20:32:55.6935001Z         x1 = x[:, D:]
2025-05-07T20:32:55.6935204Z     
2025-05-07T20:32:55.6935395Z         if contiguous:
2025-05-07T20:32:55.6935635Z             x0 = x0.contiguous()
2025-05-07T20:32:55.6935888Z             x1 = x1.contiguous()
2025-05-07T20:32:55.6936139Z     
2025-05-07T20:32:55.6936332Z         if scale_ub is not None:
2025-05-07T20:32:55.6936604Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.6936943Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.6937254Z             )
2025-05-07T20:32:55.6937444Z         else:
2025-05-07T20:32:55.6937661Z             scale_ub_tensor = None
2025-05-07T20:32:55.6937919Z     
2025-05-07T20:32:55.6938151Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.6938459Z             op = silu_mul_quant
2025-05-07T20:32:55.6938711Z             if compiled:
2025-05-07T20:32:55.6938964Z                 op = torch.compile(op)
2025-05-07T20:32:55.6939300Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.6939584Z     
2025-05-07T20:32:55.6939776Z         y_fp8, y_scale = fn()
2025-05-07T20:32:55.6940055Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:55.6940348Z     
2025-05-07T20:32:55.6940591Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.6940921Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:55.6941211Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:55.6941525Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:55.6941885Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.6942242Z     
2025-05-07T20:32:55.6942441Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:55.6942635Z 
2025-05-07T20:32:55.6942742Z moe/activation_test.py:126: 
2025-05-07T20:32:55.6943032Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.6943415Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:55.6943745Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.6944534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:55.6945344Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:55.6945891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.6946572Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.6947249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:55.6947977Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.6948734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:55.6949478Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.6950245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:55.6950936Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:55.6951544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:55.6952063Z     fn()
2025-05-07T20:32:55.6952569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:55.6953157Z     self.fn.run(
2025-05-07T20:32:55.6953629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.6954154Z     kernel = self.compile(
2025-05-07T20:32:55.6954700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.6955354Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.6955748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.6955983Z 
2025-05-07T20:32:55.6956193Z self = <triton.compiler.compiler.ASTSource object at 0x7f8919ce4b20>
2025-05-07T20:32:55.6957286Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.6958696Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8919dff550>}
2025-05-07T20:32:55.6960078Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.6961100Z context = <triton._C.libtriton.ir.context object at 0x7f891993f4f0>
2025-05-07T20:32:55.6961391Z 
2025-05-07T20:32:55.6961563Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.6962093Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.6962568Z                            module_map=module_map)
2025-05-07T20:32:55.6962930Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.6963291Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:55.6963557Z E       ^
2025-05-07T20:32:55.6964065Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.6964521Z 
2025-05-07T20:32:55.6964974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.6965493Z 
2025-05-07T20:32:55.6965596Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.6966009Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.6966410Z     T=4096,
2025-05-07T20:32:55.6966600Z     D=5120,
2025-05-07T20:32:55.6966838Z     scale_ub=None,
2025-05-07T20:32:55.6967048Z     contiguous=False,
2025-05-07T20:32:55.6967279Z     compiled=False,
2025-05-07T20:32:55.6967483Z )
2025-05-07T20:32:57.4462128Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.4462876Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:57.4463274Z 
2025-05-07T20:32:57.4463379Z     @given(
2025-05-07T20:32:57.4463692Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.4464057Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.4464377Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.4464722Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.4465050Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.4465339Z     )
2025-05-07T20:32:57.4465697Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.4466280Z     def test_silu_mul_quant(
2025-05-07T20:32:57.4466527Z         self,
2025-05-07T20:32:57.4466727Z         T: int,
2025-05-07T20:32:57.4466925Z         D: int,
2025-05-07T20:32:57.4467140Z         scale_ub: Optional[float],
2025-05-07T20:32:57.4467414Z         contiguous: bool,
2025-05-07T20:32:57.4467654Z         compiled: bool,
2025-05-07T20:32:57.4467877Z     ) -> None:
2025-05-07T20:32:57.4468099Z         torch.manual_seed(2025)
2025-05-07T20:32:57.4468344Z     
2025-05-07T20:32:57.4468608Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.4468951Z     
2025-05-07T20:32:57.4469142Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.4469462Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.4469862Z         x = x_sign * x_clamp
2025-05-07T20:32:57.4470105Z         x0 = x[:, :D]
2025-05-07T20:32:57.4470317Z         x1 = x[:, D:]
2025-05-07T20:32:57.4470526Z     
2025-05-07T20:32:57.4470716Z         if contiguous:
2025-05-07T20:32:57.4470947Z             x0 = x0.contiguous()
2025-05-07T20:32:57.4471212Z             x1 = x1.contiguous()
2025-05-07T20:32:57.4471451Z     
2025-05-07T20:32:57.4471641Z         if scale_ub is not None:
2025-05-07T20:32:57.4471909Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.4472247Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.4472558Z             )
2025-05-07T20:32:57.4472747Z         else:
2025-05-07T20:32:57.4472966Z             scale_ub_tensor = None
2025-05-07T20:32:57.4473218Z     
2025-05-07T20:32:57.4473444Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.4473758Z             op = silu_mul_quant
2025-05-07T20:32:57.4474011Z             if compiled:
2025-05-07T20:32:57.4474255Z                 op = torch.compile(op)
2025-05-07T20:32:57.4474554Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.4474830Z     
2025-05-07T20:32:57.4475015Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.4475187Z 
2025-05-07T20:32:57.4475293Z moe/activation_test.py:117: 
2025-05-07T20:32:57.4475593Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.4475931Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.4476212Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.4476908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.4477678Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.4478208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.4478943Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.4479604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.4480133Z     kernel = self.compile(
2025-05-07T20:32:57.4480677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.4481390Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.4481791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.4482017Z 
2025-05-07T20:32:57.4482227Z self = <triton.compiler.compiler.ASTSource object at 0x7f8919a3c4f0>
2025-05-07T20:32:57.4483316Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.4484704Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8919a23940>}
2025-05-07T20:32:57.4486090Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.4487124Z context = <triton._C.libtriton.ir.context object at 0x7f8919369330>
2025-05-07T20:32:57.4487410Z 
2025-05-07T20:32:57.4487576Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.4488098Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.4488573Z                            module_map=module_map)
2025-05-07T20:32:57.4488948Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.4489328Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.4489609Z E       ^
2025-05-07T20:32:57.4490084Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.4490536Z 
2025-05-07T20:32:57.4490960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.4491481Z 
2025-05-07T20:32:57.4491583Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.4491997Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.4492398Z     T=4096,
2025-05-07T20:32:57.4492581Z     D=7168,
2025-05-07T20:32:57.4492778Z     scale_ub=None,
2025-05-07T20:32:57.4493005Z     contiguous=False,
2025-05-07T20:32:57.4493223Z     compiled=False,
2025-05-07T20:32:57.4493432Z )
2025-05-07T20:32:57.4493752Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.4494245Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:57.4494525Z 
2025-05-07T20:32:57.4494605Z     @given(
2025-05-07T20:32:57.4494836Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.4495149Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.4495450Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.4495784Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.4496114Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.4496391Z     )
2025-05-07T20:32:57.4496737Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.4497180Z     def test_silu_mul_quant(
2025-05-07T20:32:57.4497471Z         self,
2025-05-07T20:32:57.4497664Z         T: int,
2025-05-07T20:32:57.4497859Z         D: int,
2025-05-07T20:32:57.4498071Z         scale_ub: Optional[float],
2025-05-07T20:32:57.4498340Z         contiguous: bool,
2025-05-07T20:32:57.4498577Z         compiled: bool,
2025-05-07T20:32:57.4498840Z     ) -> None:
2025-05-07T20:32:57.4499059Z         torch.manual_seed(2025)
2025-05-07T20:32:57.4499299Z     
2025-05-07T20:32:57.4499586Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.4499951Z     
2025-05-07T20:32:57.4500144Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.4500481Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.4500784Z         x = x_sign * x_clamp
2025-05-07T20:32:57.4501024Z         x0 = x[:, :D]
2025-05-07T20:32:57.4501246Z         x1 = x[:, D:]
2025-05-07T20:32:57.4501449Z     
2025-05-07T20:32:57.4501635Z         if contiguous:
2025-05-07T20:32:57.4501870Z             x0 = x0.contiguous()
2025-05-07T20:32:57.4502128Z             x1 = x1.contiguous()
2025-05-07T20:32:57.4502369Z     
2025-05-07T20:32:57.4502560Z         if scale_ub is not None:
2025-05-07T20:32:57.4502827Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.4503175Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.4503487Z             )
2025-05-07T20:32:57.4503679Z         else:
2025-05-07T20:32:57.4504071Z             scale_ub_tensor = None
2025-05-07T20:32:57.4504321Z     
2025-05-07T20:32:57.4504552Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.4504859Z             op = silu_mul_quant
2025-05-07T20:32:57.4505186Z             if compiled:
2025-05-07T20:32:57.4505434Z                 op = torch.compile(op)
2025-05-07T20:32:57.4505724Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.4505995Z     
2025-05-07T20:32:57.4506187Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.4506350Z 
2025-05-07T20:32:57.4506449Z moe/activation_test.py:117: 
2025-05-07T20:32:57.4506750Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.4507081Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.4507363Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.4508060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.4508751Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.4509291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.4510037Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.4510695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.4511224Z     kernel = self.compile(
2025-05-07T20:32:57.4511762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.4512411Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.4512801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.4513028Z 
2025-05-07T20:32:57.4513242Z self = <triton.compiler.compiler.ASTSource object at 0x7f891b284520>
2025-05-07T20:32:57.4514330Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.4515708Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89199e65e0>}
2025-05-07T20:32:57.4517049Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.4518145Z context = <triton._C.libtriton.ir.context object at 0x7f89198c5db0>
2025-05-07T20:32:57.4518430Z 
2025-05-07T20:32:57.4518602Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.4519200Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.4519695Z                            module_map=module_map)
2025-05-07T20:32:57.4520061Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.4520412Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.4520731Z E       ^
2025-05-07T20:32:57.4521201Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.4521651Z 
2025-05-07T20:32:57.4522070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.4522585Z 
2025-05-07T20:32:57.4522696Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.4523102Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.4523504Z     T=128,
2025-05-07T20:32:57.4523693Z     D=7168,
2025-05-07T20:32:57.4523880Z     scale_ub=None,
2025-05-07T20:32:57.4524098Z     contiguous=False,
2025-05-07T20:32:57.4524324Z     compiled=True,
2025-05-07T20:32:57.4524521Z )
2025-05-07T20:32:57.5282488Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.5283225Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:57.5283651Z 
2025-05-07T20:32:57.5283755Z     @given(
2025-05-07T20:32:57.5284061Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.5284477Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.5284815Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.5285140Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.5285461Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.5285742Z     )
2025-05-07T20:32:57.5286085Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.5286517Z     def test_silu_mul_quant(
2025-05-07T20:32:57.5286755Z         self,
2025-05-07T20:32:57.5286942Z         T: int,
2025-05-07T20:32:57.5287125Z         D: int,
2025-05-07T20:32:57.5287342Z         scale_ub: Optional[float],
2025-05-07T20:32:57.5287614Z         contiguous: bool,
2025-05-07T20:32:57.5287842Z         compiled: bool,
2025-05-07T20:32:57.5288067Z     ) -> None:
2025-05-07T20:32:57.5288280Z         torch.manual_seed(2025)
2025-05-07T20:32:57.5288513Z     
2025-05-07T20:32:57.5288769Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.5289104Z     
2025-05-07T20:32:57.5289290Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.5289579Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.5289883Z         x = x_sign * x_clamp
2025-05-07T20:32:57.5296632Z         x0 = x[:, :D]
2025-05-07T20:32:57.5296931Z         x1 = x[:, D:]
2025-05-07T20:32:57.5297140Z     
2025-05-07T20:32:57.5297320Z         if contiguous:
2025-05-07T20:32:57.5297561Z             x0 = x0.contiguous()
2025-05-07T20:32:57.5297820Z             x1 = x1.contiguous()
2025-05-07T20:32:57.5298046Z     
2025-05-07T20:32:57.5298234Z         if scale_ub is not None:
2025-05-07T20:32:57.5298510Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.5298844Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.5299178Z             )
2025-05-07T20:32:57.5299395Z         else:
2025-05-07T20:32:57.5299598Z             scale_ub_tensor = None
2025-05-07T20:32:57.5299853Z     
2025-05-07T20:32:57.5300090Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.5300401Z             op = silu_mul_quant
2025-05-07T20:32:57.5300650Z             if compiled:
2025-05-07T20:32:57.5301019Z                 op = torch.compile(op)
2025-05-07T20:32:57.5301312Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.5301588Z     
2025-05-07T20:32:57.5301782Z         y_fp8, y_scale = fn()
2025-05-07T20:32:57.5302137Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:57.5302426Z     
2025-05-07T20:32:57.5302661Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.5302998Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:57.5303280Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:57.5303597Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:57.5304305Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:57.5304613Z     
2025-05-07T20:32:57.5304809Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:57.5305003Z 
2025-05-07T20:32:57.5305104Z moe/activation_test.py:126: 
2025-05-07T20:32:57.5305395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.5305734Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:57.5306068Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:57.5306884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:57.5307649Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:57.5308195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.5308949Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.5309695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:57.5310471Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:57.5311227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:57.5311977Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:57.5312708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:57.5313346Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:57.5313953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:57.5314469Z     fn()
2025-05-07T20:32:57.5314977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:57.5315562Z     self.fn.run(
2025-05-07T20:32:57.5316031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.5316564Z     kernel = self.compile(
2025-05-07T20:32:57.5317106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.5317768Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.5318165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.5318393Z 
2025-05-07T20:32:57.5318597Z self = <triton.compiler.compiler.ASTSource object at 0x7f8919b74610>
2025-05-07T20:32:57.5319693Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.5321092Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8919e54310>}
2025-05-07T20:32:57.5322455Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.5323563Z context = <triton._C.libtriton.ir.context object at 0x7f89192cf970>
2025-05-07T20:32:57.5323851Z 
2025-05-07T20:32:57.5324071Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.5324600Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.5325079Z                            module_map=module_map)
2025-05-07T20:32:57.5325514Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.5325861Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:57.5326123Z E       ^
2025-05-07T20:32:57.5326591Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.5327052Z 
2025-05-07T20:32:57.5327483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.5328000Z 
2025-05-07T20:32:57.5328098Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.5328511Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.5328918Z     T=128,
2025-05-07T20:32:57.5329098Z     D=7168,
2025-05-07T20:32:57.5329309Z     scale_ub=None,
2025-05-07T20:32:57.5329545Z     contiguous=False,
2025-05-07T20:32:57.5329767Z     compiled=False,
2025-05-07T20:32:57.5329972Z )
2025-05-07T20:32:57.9351061Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.9351856Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:57.9352235Z 
2025-05-07T20:32:57.9352345Z     @given(
2025-05-07T20:32:57.9352664Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.9353072Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.9353476Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.9353903Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.9354227Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.9354520Z     )
2025-05-07T20:32:57.9354876Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.9355321Z     def test_silu_mul_quant(
2025-05-07T20:32:57.9355560Z         self,
2025-05-07T20:32:57.9355756Z         T: int,
2025-05-07T20:32:57.9355957Z         D: int,
2025-05-07T20:32:57.9356169Z         scale_ub: Optional[float],
2025-05-07T20:32:57.9356449Z         contiguous: bool,
2025-05-07T20:32:57.9356689Z         compiled: bool,
2025-05-07T20:32:57.9356911Z     ) -> None:
2025-05-07T20:32:57.9357132Z         torch.manual_seed(2025)
2025-05-07T20:32:57.9357376Z     
2025-05-07T20:32:57.9357644Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.9357986Z     
2025-05-07T20:32:57.9358181Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.9358474Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.9358786Z         x = x_sign * x_clamp
2025-05-07T20:32:57.9359030Z         x0 = x[:, :D]
2025-05-07T20:32:57.9359244Z         x1 = x[:, D:]
2025-05-07T20:32:57.9359457Z     
2025-05-07T20:32:57.9359645Z         if contiguous:
2025-05-07T20:32:57.9359873Z             x0 = x0.contiguous()
2025-05-07T20:32:57.9360136Z             x1 = x1.contiguous()
2025-05-07T20:32:57.9360378Z     
2025-05-07T20:32:57.9360571Z         if scale_ub is not None:
2025-05-07T20:32:57.9360845Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.9361183Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.9361498Z             )
2025-05-07T20:32:57.9361685Z         else:
2025-05-07T20:32:57.9361898Z             scale_ub_tensor = None
2025-05-07T20:32:57.9362151Z     
2025-05-07T20:32:57.9362379Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.9362782Z             op = silu_mul_quant
2025-05-07T20:32:57.9363032Z             if compiled:
2025-05-07T20:32:57.9363275Z                 op = torch.compile(op)
2025-05-07T20:32:57.9363575Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.9363852Z     
2025-05-07T20:32:57.9364103Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.9364279Z 
2025-05-07T20:32:57.9364380Z moe/activation_test.py:117: 
2025-05-07T20:32:57.9364681Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.9365013Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.9365351Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.9366045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.9366746Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.9367279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.9367962Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.9368625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.9369159Z     kernel = self.compile(
2025-05-07T20:32:57.9369697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.9370350Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.9370792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.9371024Z 
2025-05-07T20:32:57.9371234Z self = <triton.compiler.compiler.ASTSource object at 0x7f891967bdc0>
2025-05-07T20:32:57.9372315Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.9373697Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f891957e160>}
2025-05-07T20:32:57.9375050Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.9376072Z context = <triton._C.libtriton.ir.context object at 0x7f8918e63ab0>
2025-05-07T20:32:57.9376362Z 
2025-05-07T20:32:57.9376527Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.9377046Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.9377515Z                            module_map=module_map)
2025-05-07T20:32:57.9377886Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.9378242Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.9378506Z E       ^
2025-05-07T20:32:57.9378978Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.9379431Z 
2025-05-07T20:32:57.9379852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.9380364Z 
2025-05-07T20:32:57.9380464Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.9380885Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.9381300Z     T=4096,
2025-05-07T20:32:57.9381486Z     D=5120,
2025-05-07T20:32:57.9381679Z     scale_ub=1200.0,
2025-05-07T20:32:57.9381907Z     contiguous=True,
2025-05-07T20:32:57.9382130Z     compiled=False,
2025-05-07T20:32:57.9382336Z )
2025-05-07T20:32:57.9382656Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.9383197Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:57.9383475Z 
2025-05-07T20:32:57.9383552Z     @given(
2025-05-07T20:32:57.9383782Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.9384186Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.9384489Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.9384814Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.9385143Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.9385422Z     )
2025-05-07T20:32:57.9385774Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.9386279Z     def test_silu_mul_quant(
2025-05-07T20:32:57.9386512Z         self,
2025-05-07T20:32:57.9386705Z         T: int,
2025-05-07T20:32:57.9386901Z         D: int,
2025-05-07T20:32:57.9387114Z         scale_ub: Optional[float],
2025-05-07T20:32:57.9387384Z         contiguous: bool,
2025-05-07T20:32:57.9387627Z         compiled: bool,
2025-05-07T20:32:57.9387845Z     ) -> None:
2025-05-07T20:32:57.9388065Z         torch.manual_seed(2025)
2025-05-07T20:32:57.9388309Z     
2025-05-07T20:32:57.9388577Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.9388929Z     
2025-05-07T20:32:57.9389123Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.9389442Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.9389778Z         x = x_sign * x_clamp
2025-05-07T20:32:57.9390081Z         x0 = x[:, :D]
2025-05-07T20:32:57.9390300Z         x1 = x[:, D:]
2025-05-07T20:32:57.9390568Z     
2025-05-07T20:32:57.9390748Z         if contiguous:
2025-05-07T20:32:57.9390977Z             x0 = x0.contiguous()
2025-05-07T20:32:57.9391240Z             x1 = x1.contiguous()
2025-05-07T20:32:57.9391473Z     
2025-05-07T20:32:57.9391668Z         if scale_ub is not None:
2025-05-07T20:32:57.9391942Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.9392278Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.9392587Z             )
2025-05-07T20:32:57.9392781Z         else:
2025-05-07T20:32:57.9392999Z             scale_ub_tensor = None
2025-05-07T20:32:57.9393247Z     
2025-05-07T20:32:57.9393476Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.9393790Z             op = silu_mul_quant
2025-05-07T20:32:57.9394036Z             if compiled:
2025-05-07T20:32:57.9394291Z                 op = torch.compile(op)
2025-05-07T20:32:57.9394589Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.9394862Z     
2025-05-07T20:32:57.9395067Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.9395230Z 
2025-05-07T20:32:57.9395335Z moe/activation_test.py:117: 
2025-05-07T20:32:57.9395623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.9395959Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.9396233Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.9396926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.9397610Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.9398148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.9398832Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.9399513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.9400070Z     kernel = self.compile(
2025-05-07T20:32:57.9400608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.9401259Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.9401646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.9401926Z 
2025-05-07T20:32:57.9402129Z self = <triton.compiler.compiler.ASTSource object at 0x7f8918e74b20>
2025-05-07T20:32:57.9403247Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.9404868Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89194f75e0>}
2025-05-07T20:32:57.9406279Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.9407309Z context = <triton._C.libtriton.ir.context object at 0x7f89192b7930>
2025-05-07T20:32:57.9407603Z 
2025-05-07T20:32:57.9407766Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.9408288Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.9408749Z                            module_map=module_map)
2025-05-07T20:32:57.9409121Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.9409473Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.9409731Z E       ^
2025-05-07T20:32:57.9410193Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.9410659Z 
2025-05-07T20:32:57.9411141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.9411652Z 
2025-05-07T20:32:57.9411760Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.9412178Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.9412582Z     T=1,
2025-05-07T20:32:57.9412770Z     D=5120,
2025-05-07T20:32:57.9412960Z     scale_ub=None,
2025-05-07T20:32:57.9413171Z     contiguous=True,
2025-05-07T20:32:57.9413398Z     compiled=True,
2025-05-07T20:32:57.9413601Z )
2025-05-07T20:32:58.6023021Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.6023728Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:58.6024117Z 
2025-05-07T20:32:58.6024224Z     @given(
2025-05-07T20:32:58.6024527Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.6024863Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.6025175Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.6025497Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.6025821Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.6026101Z     )
2025-05-07T20:32:58.6026440Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.6026890Z     def test_silu_mul_quant(
2025-05-07T20:32:58.6027131Z         self,
2025-05-07T20:32:58.6027317Z         T: int,
2025-05-07T20:32:58.6027511Z         D: int,
2025-05-07T20:32:58.6027731Z         scale_ub: Optional[float],
2025-05-07T20:32:58.6027997Z         contiguous: bool,
2025-05-07T20:32:58.6028237Z         compiled: bool,
2025-05-07T20:32:58.6028459Z     ) -> None:
2025-05-07T20:32:58.6028672Z         torch.manual_seed(2025)
2025-05-07T20:32:58.6028905Z     
2025-05-07T20:32:58.6029175Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.6029522Z     
2025-05-07T20:32:58.6029706Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.6030078Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.6030379Z         x = x_sign * x_clamp
2025-05-07T20:32:58.6030611Z         x0 = x[:, :D]
2025-05-07T20:32:58.6030826Z         x1 = x[:, D:]
2025-05-07T20:32:58.6031033Z     
2025-05-07T20:32:58.6031209Z         if contiguous:
2025-05-07T20:32:58.6031555Z             x0 = x0.contiguous()
2025-05-07T20:32:58.6031812Z             x1 = x1.contiguous()
2025-05-07T20:32:58.6032042Z     
2025-05-07T20:32:58.6032231Z         if scale_ub is not None:
2025-05-07T20:32:58.6032503Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.6032897Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.6033208Z             )
2025-05-07T20:32:58.6033401Z         else:
2025-05-07T20:32:58.6033601Z             scale_ub_tensor = None
2025-05-07T20:32:58.6033847Z     
2025-05-07T20:32:58.6034081Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.6034455Z             op = silu_mul_quant
2025-05-07T20:32:58.6034696Z             if compiled:
2025-05-07T20:32:58.6034941Z                 op = torch.compile(op)
2025-05-07T20:32:58.6035236Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.6035503Z     
2025-05-07T20:32:58.6035691Z         y_fp8, y_scale = fn()
2025-05-07T20:32:58.6035977Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:58.6036258Z     
2025-05-07T20:32:58.6036488Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.6036820Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:58.6037107Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:58.6037420Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:58.6037777Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.6038090Z     
2025-05-07T20:32:58.6038282Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:58.6038549Z 
2025-05-07T20:32:58.6038650Z moe/activation_test.py:126: 
2025-05-07T20:32:58.6038943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.6039270Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:58.6039597Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.6040392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:58.6041160Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:58.6041701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.6042382Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.6043061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:58.6043774Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.6044533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:58.6045276Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.6045999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:58.6046623Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:58.6047225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:58.6047742Z     fn()
2025-05-07T20:32:58.6048242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:58.6048812Z     self.fn.run(
2025-05-07T20:32:58.6049279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.6049811Z     kernel = self.compile(
2025-05-07T20:32:58.6050344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.6050994Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.6051441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.6051667Z 
2025-05-07T20:32:58.6051877Z self = <triton.compiler.compiler.ASTSource object at 0x7f891927f400>
2025-05-07T20:32:58.6053001Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.6054385Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f891928c5e0>}
2025-05-07T20:32:58.6055767Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.6056784Z context = <triton._C.libtriton.ir.context object at 0x7f891923ae30>
2025-05-07T20:32:58.6057069Z 
2025-05-07T20:32:58.6057234Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.6057756Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.6058227Z                            module_map=module_map)
2025-05-07T20:32:58.6058589Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.6058937Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:58.6059203Z E       ^
2025-05-07T20:32:58.6059759Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.6060213Z 
2025-05-07T20:32:58.6060633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.6061324Z 
2025-05-07T20:32:58.6061426Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.6061835Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.6062242Z     T=2048,
2025-05-07T20:32:58.6062424Z     D=5120,
2025-05-07T20:32:58.6062611Z     scale_ub=None,
2025-05-07T20:32:58.6062823Z     contiguous=True,
2025-05-07T20:32:58.6063038Z     compiled=True,
2025-05-07T20:32:58.6063241Z )
2025-05-07T20:32:59.2242105Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.2243524Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:59.2244260Z 
2025-05-07T20:32:59.2244475Z     @given(
2025-05-07T20:32:59.2245077Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.2245780Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.2246375Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.2247019Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.2247647Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.2248208Z     )
2025-05-07T20:32:59.2248895Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.2249691Z     def test_silu_mul_quant(
2025-05-07T20:32:59.2249967Z         self,
2025-05-07T20:32:59.2250171Z         T: int,
2025-05-07T20:32:59.2250362Z         D: int,
2025-05-07T20:32:59.2250581Z         scale_ub: Optional[float],
2025-05-07T20:32:59.2250854Z         contiguous: bool,
2025-05-07T20:32:59.2251097Z         compiled: bool,
2025-05-07T20:32:59.2251320Z     ) -> None:
2025-05-07T20:32:59.2251536Z         torch.manual_seed(2025)
2025-05-07T20:32:59.2251775Z     
2025-05-07T20:32:59.2252044Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.2252387Z     
2025-05-07T20:32:59.2252577Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.2252861Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.2253169Z         x = x_sign * x_clamp
2025-05-07T20:32:59.2259277Z         x0 = x[:, :D]
2025-05-07T20:32:59.2259728Z         x1 = x[:, D:]
2025-05-07T20:32:59.2259937Z     
2025-05-07T20:32:59.2260119Z         if contiguous:
2025-05-07T20:32:59.2260358Z             x0 = x0.contiguous()
2025-05-07T20:32:59.2260620Z             x1 = x1.contiguous()
2025-05-07T20:32:59.2260851Z     
2025-05-07T20:32:59.2261118Z         if scale_ub is not None:
2025-05-07T20:32:59.2261393Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.2261729Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.2262040Z             )
2025-05-07T20:32:59.2262237Z         else:
2025-05-07T20:32:59.2262441Z             scale_ub_tensor = None
2025-05-07T20:32:59.2262769Z     
2025-05-07T20:32:59.2263005Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.2263317Z             op = silu_mul_quant
2025-05-07T20:32:59.2263576Z             if compiled:
2025-05-07T20:32:59.2263832Z                 op = torch.compile(op)
2025-05-07T20:32:59.2264140Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.2264410Z     
2025-05-07T20:32:59.2264601Z         y_fp8, y_scale = fn()
2025-05-07T20:32:59.2264888Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:59.2265174Z     
2025-05-07T20:32:59.2265413Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.2265758Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:59.2266045Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:59.2266360Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:59.2266725Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:59.2267102Z     
2025-05-07T20:32:59.2267308Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:59.2267502Z 
2025-05-07T20:32:59.2267606Z moe/activation_test.py:126: 
2025-05-07T20:32:59.2267903Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.2268232Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:59.2268570Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:59.2269363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:59.2270199Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:59.2270750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.2271431Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.2272126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:59.2272842Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:59.2273595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:59.2274339Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:59.2275062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:59.2275696Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:59.2276303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:59.2276816Z     fn()
2025-05-07T20:32:59.2277317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:59.2277904Z     self.fn.run(
2025-05-07T20:32:59.2278368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.2278896Z     kernel = self.compile(
2025-05-07T20:32:59.2279437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.2280144Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.2280544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.2280768Z 
2025-05-07T20:32:59.2281007Z self = <triton.compiler.compiler.ASTSource object at 0x7f89191d7130>
2025-05-07T20:32:59.2282099Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.2283541Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8918da8f70>}
2025-05-07T20:32:59.2284885Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.2285907Z context = <triton._C.libtriton.ir.context object at 0x7f8918f4f430>
2025-05-07T20:32:59.2286192Z 
2025-05-07T20:32:59.2286360Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.2286887Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.2287355Z                            module_map=module_map)
2025-05-07T20:32:59.2287727Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.2288075Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:59.2288384Z E       ^
2025-05-07T20:32:59.2288856Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.2289307Z 
2025-05-07T20:32:59.2289726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.2290240Z 
2025-05-07T20:32:59.2290345Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.2290761Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.2291167Z     T=128,
2025-05-07T20:32:59.2291347Z     D=5120,
2025-05-07T20:32:59.2291542Z     scale_ub=None,
2025-05-07T20:32:59.2291763Z     contiguous=True,
2025-05-07T20:32:59.2291983Z     compiled=True,
2025-05-07T20:32:59.2292188Z )
2025-05-07T20:33:00.2026095Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.2026799Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:00.2027198Z 
2025-05-07T20:33:00.2027325Z     @given(
2025-05-07T20:33:00.2027630Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.2028050Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.2028460Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.2028789Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.2029122Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.2029411Z     )
2025-05-07T20:33:00.2029782Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.2030278Z     def test_silu_mul_quant(
2025-05-07T20:33:00.2030520Z         self,
2025-05-07T20:33:00.2030716Z         T: int,
2025-05-07T20:33:00.2030913Z         D: int,
2025-05-07T20:33:00.2031124Z         scale_ub: Optional[float],
2025-05-07T20:33:00.2031403Z         contiguous: bool,
2025-05-07T20:33:00.2031641Z         compiled: bool,
2025-05-07T20:33:00.2031862Z     ) -> None:
2025-05-07T20:33:00.2032087Z         torch.manual_seed(2025)
2025-05-07T20:33:00.2032330Z     
2025-05-07T20:33:00.2032599Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.2032945Z     
2025-05-07T20:33:00.2033142Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.2033427Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.2033872Z         x = x_sign * x_clamp
2025-05-07T20:33:00.2034114Z         x0 = x[:, :D]
2025-05-07T20:33:00.2034323Z         x1 = x[:, D:]
2025-05-07T20:33:00.2034531Z     
2025-05-07T20:33:00.2034716Z         if contiguous:
2025-05-07T20:33:00.2034940Z             x0 = x0.contiguous()
2025-05-07T20:33:00.2035266Z             x1 = x1.contiguous()
2025-05-07T20:33:00.2035511Z     
2025-05-07T20:33:00.2035704Z         if scale_ub is not None:
2025-05-07T20:33:00.2035972Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.2036310Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.2036689Z             )
2025-05-07T20:33:00.2036880Z         else:
2025-05-07T20:33:00.2037094Z             scale_ub_tensor = None
2025-05-07T20:33:00.2037345Z     
2025-05-07T20:33:00.2037570Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.2037886Z             op = silu_mul_quant
2025-05-07T20:33:00.2038134Z             if compiled:
2025-05-07T20:33:00.2038375Z                 op = torch.compile(op)
2025-05-07T20:33:00.2038677Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.2038950Z     
2025-05-07T20:33:00.2039135Z         y_fp8, y_scale = fn()
2025-05-07T20:33:00.2039424Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:00.2039733Z     
2025-05-07T20:33:00.2040004Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.2040334Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:00.2040625Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:00.2040941Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:00.2041364Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.2041679Z     
2025-05-07T20:33:00.2041878Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:00.2042076Z 
2025-05-07T20:33:00.2042176Z moe/activation_test.py:126: 
2025-05-07T20:33:00.2042471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.2042807Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:00.2043135Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.2043921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:00.2044686Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:00.2045232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.2045915Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.2046603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:00.2047319Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.2048067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:00.2048806Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.2049535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:00.2050229Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:00.2050827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:00.2051336Z     fn()
2025-05-07T20:33:00.2051840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:00.2052423Z     self.fn.run(
2025-05-07T20:33:00.2052887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.2053418Z     kernel = self.compile(
2025-05-07T20:33:00.2053955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.2054655Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.2055046Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.2055314Z 
2025-05-07T20:33:00.2055520Z self = <triton.compiler.compiler.ASTSource object at 0x7f8918d1c0a0>
2025-05-07T20:33:00.2056613Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.2058060Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8919050a60>}
2025-05-07T20:33:00.2059401Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.2060484Z context = <triton._C.libtriton.ir.context object at 0x7f8918b8cc30>
2025-05-07T20:33:00.2060776Z 
2025-05-07T20:33:00.2060946Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.2061475Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.2061942Z                            module_map=module_map)
2025-05-07T20:33:00.2062309Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.2062712Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:00.2062973Z E       ^
2025-05-07T20:33:00.2063438Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.2063900Z 
2025-05-07T20:33:00.2064318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.2064837Z 
2025-05-07T20:33:00.2064947Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.2065356Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.2065765Z     T=4096,
2025-05-07T20:33:00.2065951Z     D=5120,
2025-05-07T20:33:00.2066144Z     scale_ub=None,
2025-05-07T20:33:00.2066356Z     contiguous=True,
2025-05-07T20:33:00.2066577Z     compiled=True,
2025-05-07T20:33:00.2066776Z )
2025-05-07T20:33:01.0410015Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.0410646Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:01.0410917Z 
2025-05-07T20:33:01.0411007Z     @given(
2025-05-07T20:33:01.0411233Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.0411548Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.0411852Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.0412185Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.0412530Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.0412814Z     )
2025-05-07T20:33:01.0413163Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.0413609Z     def test_silu_mul_quant(
2025-05-07T20:33:01.0413850Z         self,
2025-05-07T20:33:01.0414037Z         T: int,
2025-05-07T20:33:01.0414237Z         D: int,
2025-05-07T20:33:01.0414456Z         scale_ub: Optional[float],
2025-05-07T20:33:01.0414720Z         contiguous: bool,
2025-05-07T20:33:01.0414965Z         compiled: bool,
2025-05-07T20:33:01.0415188Z     ) -> None:
2025-05-07T20:33:01.0415397Z         torch.manual_seed(2025)
2025-05-07T20:33:01.0415644Z     
2025-05-07T20:33:01.0415915Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.0416254Z     
2025-05-07T20:33:01.0416449Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.0416896Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.0417202Z         x = x_sign * x_clamp
2025-05-07T20:33:01.0417449Z         x0 = x[:, :D]
2025-05-07T20:33:01.0417664Z         x1 = x[:, D:]
2025-05-07T20:33:01.0417869Z     
2025-05-07T20:33:01.0418118Z         if contiguous:
2025-05-07T20:33:01.0418347Z             x0 = x0.contiguous()
2025-05-07T20:33:01.0418602Z             x1 = x1.contiguous()
2025-05-07T20:33:01.0418833Z     
2025-05-07T20:33:01.0419021Z         if scale_ub is not None:
2025-05-07T20:33:01.0419292Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.0419693Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.0419999Z             )
2025-05-07T20:33:01.0420188Z         else:
2025-05-07T20:33:01.0420388Z             scale_ub_tensor = None
2025-05-07T20:33:01.0420638Z     
2025-05-07T20:33:01.0420869Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.0421177Z             op = silu_mul_quant
2025-05-07T20:33:01.0421431Z             if compiled:
2025-05-07T20:33:01.0421674Z                 op = torch.compile(op)
2025-05-07T20:33:01.0421966Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.0422241Z     
2025-05-07T20:33:01.0422430Z         y_fp8, y_scale = fn()
2025-05-07T20:33:01.0422719Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:01.0423004Z     
2025-05-07T20:33:01.0423237Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.0423576Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:01.0423934Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:01.0424252Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:01.0424612Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.0424914Z     
2025-05-07T20:33:01.0425117Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:01.0425311Z 
2025-05-07T20:33:01.0425417Z moe/activation_test.py:126: 
2025-05-07T20:33:01.0425711Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.0426049Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:01.0426377Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.0427171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:01.0427924Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:01.0428471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.0429161Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.0429963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:01.0430709Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.0431472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:01.0432221Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.0432951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:01.0433590Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:01.0434197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:01.0434719Z     fn()
2025-05-07T20:33:01.0435217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:01.0435799Z     self.fn.run(
2025-05-07T20:33:01.0436267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.0436843Z     kernel = self.compile(
2025-05-07T20:33:01.0437382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.0438031Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.0438468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.0438699Z 
2025-05-07T20:33:01.0438903Z self = <triton.compiler.compiler.ASTSource object at 0x7f891903aee0>
2025-05-07T20:33:01.0439999Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.0441489Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8918c30670>}
2025-05-07T20:33:01.0442834Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.0443860Z context = <triton._C.libtriton.ir.context object at 0x7f8918774f30>
2025-05-07T20:33:01.0444144Z 
2025-05-07T20:33:01.0444308Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.0444836Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.0445342Z                            module_map=module_map)
2025-05-07T20:33:01.0445708Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.0446060Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:01.0446330Z E       ^
2025-05-07T20:33:01.0446799Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.0447254Z 
2025-05-07T20:33:01.0447667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.0448183Z 
2025-05-07T20:33:01.0448283Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.0448702Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.0449104Z     T=16384,
2025-05-07T20:33:01.0449289Z     D=5120,
2025-05-07T20:33:01.0449481Z     scale_ub=None,
2025-05-07T20:33:01.0449691Z     contiguous=True,
2025-05-07T20:33:01.0449906Z     compiled=True,
2025-05-07T20:33:01.0450113Z )
2025-05-07T20:33:01.0881839Z W0507 20:33:01.086616 88371 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:33:01.0883085Z W0507 20:33:01.086616 88371 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:33:01.0884424Z W0507 20:33:01.086616 88371 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:33:01.0885424Z W0507 20:33:01.086616 88371 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:33:01.0886535Z W0507 20:33:01.086616 88371 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:33:01.2090462Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.2091012Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:01.2091289Z 
2025-05-07T20:33:01.2091366Z     @given(
2025-05-07T20:33:01.2091602Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.2091918Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.2092320Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.2092657Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.2092986Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.2093337Z     )
2025-05-07T20:33:01.2093687Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.2094129Z     def test_silu_mul_quant(
2025-05-07T20:33:01.2094372Z         self,
2025-05-07T20:33:01.2094563Z         T: int,
2025-05-07T20:33:01.2094762Z         D: int,
2025-05-07T20:33:01.2095044Z         scale_ub: Optional[float],
2025-05-07T20:33:01.2095311Z         contiguous: bool,
2025-05-07T20:33:01.2095549Z         compiled: bool,
2025-05-07T20:33:01.2095773Z     ) -> None:
2025-05-07T20:33:01.2095981Z         torch.manual_seed(2025)
2025-05-07T20:33:01.2096222Z     
2025-05-07T20:33:01.2096495Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.2096834Z     
2025-05-07T20:33:01.2097025Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.2097320Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.2097629Z         x = x_sign * x_clamp
2025-05-07T20:33:01.2097861Z         x0 = x[:, :D]
2025-05-07T20:33:01.2098087Z         x1 = x[:, D:]
2025-05-07T20:33:01.2098294Z     
2025-05-07T20:33:01.2098470Z         if contiguous:
2025-05-07T20:33:01.2098704Z             x0 = x0.contiguous()
2025-05-07T20:33:01.2098961Z             x1 = x1.contiguous()
2025-05-07T20:33:01.2099195Z     
2025-05-07T20:33:01.2099386Z         if scale_ub is not None:
2025-05-07T20:33:01.2099729Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.2100061Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.2100399Z             )
2025-05-07T20:33:01.2100615Z         else:
2025-05-07T20:33:01.2100822Z             scale_ub_tensor = None
2025-05-07T20:33:01.2101079Z     
2025-05-07T20:33:01.2101316Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.2101627Z             op = silu_mul_quant
2025-05-07T20:33:01.2101877Z             if compiled:
2025-05-07T20:33:01.2102135Z                 op = torch.compile(op)
2025-05-07T20:33:01.2102430Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.2102712Z     
2025-05-07T20:33:01.2102909Z         y_fp8, y_scale = fn()
2025-05-07T20:33:01.2103195Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:01.2103482Z     
2025-05-07T20:33:01.2103885Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.2104233Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:01.2104522Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:01.2104840Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:01.2105207Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.2105515Z     
2025-05-07T20:33:01.2105720Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:01.2105916Z 
2025-05-07T20:33:01.2106030Z moe/activation_test.py:126: 
2025-05-07T20:33:01.2106336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.2106669Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:01.2107007Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.2107814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:01.2108576Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:01.2109128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.2109877Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.2110565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:01.2111356Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.2112108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:01.2112913Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.2113652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:01.2114286Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:01.2114971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:01.2115493Z     fn()
2025-05-07T20:33:01.2115996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:01.2116575Z     self.fn.run(
2025-05-07T20:33:01.2117051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.2117581Z     kernel = self.compile(
2025-05-07T20:33:01.2118119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.2118779Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.2119177Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.2119411Z 
2025-05-07T20:33:01.2119615Z self = <triton.compiler.compiler.ASTSource object at 0x7f8918ea5a90>
2025-05-07T20:33:01.2120833Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.2122219Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8918befc10>}
2025-05-07T20:33:01.2123603Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.2124634Z context = <triton._C.libtriton.ir.context object at 0x7f89183caf70>
2025-05-07T20:33:01.2124919Z 
2025-05-07T20:33:01.2125086Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.2125620Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.2126096Z                            module_map=module_map)
2025-05-07T20:33:01.2126462Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.2126820Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:01.2127086Z E       ^
2025-05-07T20:33:01.2127548Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.2128003Z 
2025-05-07T20:33:01.2128425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.2135808Z 
2025-05-07T20:33:01.2135951Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.2136385Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.2136804Z     T=1,
2025-05-07T20:33:01.2136998Z     D=5120,
2025-05-07T20:33:01.2137193Z     scale_ub=1200.0,
2025-05-07T20:33:01.2137437Z     contiguous=True,
2025-05-07T20:33:01.2137668Z     compiled=True,
2025-05-07T20:33:01.2137877Z )
2025-05-07T20:33:01.3837391Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.3837898Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:01.3838198Z 
2025-05-07T20:33:01.3838277Z     @given(
2025-05-07T20:33:01.3838647Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.3838956Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.3839269Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.3839603Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.3839994Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.3840336Z     )
2025-05-07T20:33:01.3840694Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.3841143Z     def test_silu_mul_quant(
2025-05-07T20:33:01.3841379Z         self,
2025-05-07T20:33:01.3841644Z         T: int,
2025-05-07T20:33:01.3841847Z         D: int,
2025-05-07T20:33:01.3842066Z         scale_ub: Optional[float],
2025-05-07T20:33:01.3842345Z         contiguous: bool,
2025-05-07T20:33:01.3842585Z         compiled: bool,
2025-05-07T20:33:01.3842802Z     ) -> None:
2025-05-07T20:33:01.3843018Z         torch.manual_seed(2025)
2025-05-07T20:33:01.3843270Z     
2025-05-07T20:33:01.3843543Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.3843892Z     
2025-05-07T20:33:01.3844085Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.3844372Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.3844689Z         x = x_sign * x_clamp
2025-05-07T20:33:01.3844937Z         x0 = x[:, :D]
2025-05-07T20:33:01.3845145Z         x1 = x[:, D:]
2025-05-07T20:33:01.3845354Z     
2025-05-07T20:33:01.3845545Z         if contiguous:
2025-05-07T20:33:01.3845778Z             x0 = x0.contiguous()
2025-05-07T20:33:01.3846104Z             x1 = x1.contiguous()
2025-05-07T20:33:01.3846352Z     
2025-05-07T20:33:01.3846546Z         if scale_ub is not None:
2025-05-07T20:33:01.3846816Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.3847152Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.3847457Z             )
2025-05-07T20:33:01.3847649Z         else:
2025-05-07T20:33:01.3847869Z             scale_ub_tensor = None
2025-05-07T20:33:01.3848124Z     
2025-05-07T20:33:01.3848354Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.3848673Z             op = silu_mul_quant
2025-05-07T20:33:01.3848923Z             if compiled:
2025-05-07T20:33:01.3849174Z                 op = torch.compile(op)
2025-05-07T20:33:01.3849475Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.3849752Z     
2025-05-07T20:33:01.3849939Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.3850114Z 
2025-05-07T20:33:01.3850213Z moe/activation_test.py:117: 
2025-05-07T20:33:01.3850567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.3850902Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.3851182Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.3851747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.3852312Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.3852978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.3853660Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.3854197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.3854888Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.3855553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.3856082Z     kernel = self.compile(
2025-05-07T20:33:01.3856625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.3857278Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.3857668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.3857950Z 
2025-05-07T20:33:01.3858155Z self = <triton.compiler.compiler.ASTSource object at 0x7f891851f5b0>
2025-05-07T20:33:01.3859281Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.3860668Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8918457670>}
2025-05-07T20:33:01.3862060Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.3863081Z context = <triton._C.libtriton.ir.context object at 0x7f8917d91cb0>
2025-05-07T20:33:01.3863374Z 
2025-05-07T20:33:01.3863540Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.3864067Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.3864535Z                            module_map=module_map)
2025-05-07T20:33:01.3864899Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.3865254Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.3865510Z E       ^
2025-05-07T20:33:01.3865975Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.3866481Z 
2025-05-07T20:33:01.3866898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.3867412Z 
2025-05-07T20:33:01.3867515Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.3867931Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.3868341Z     T=1,
2025-05-07T20:33:01.3868523Z     D=5120,
2025-05-07T20:33:01.3868713Z     scale_ub=None,
2025-05-07T20:33:01.3868924Z     contiguous=False,
2025-05-07T20:33:01.3869157Z     compiled=True,
2025-05-07T20:33:01.3869363Z )
2025-05-07T20:33:01.4675961Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.4676501Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:01.4676768Z 
2025-05-07T20:33:01.4676859Z     @given(
2025-05-07T20:33:01.4677092Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.4677423Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.4677739Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.4678072Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.4678411Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.4678706Z     )
2025-05-07T20:33:01.4679068Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.4679512Z     def test_silu_mul_quant(
2025-05-07T20:33:01.4679764Z         self,
2025-05-07T20:33:01.4679968Z         T: int,
2025-05-07T20:33:01.4680187Z         D: int,
2025-05-07T20:33:01.4680436Z         scale_ub: Optional[float],
2025-05-07T20:33:01.4680720Z         contiguous: bool,
2025-05-07T20:33:01.4680958Z         compiled: bool,
2025-05-07T20:33:01.4681187Z     ) -> None:
2025-05-07T20:33:01.4681410Z         torch.manual_seed(2025)
2025-05-07T20:33:01.4681648Z     
2025-05-07T20:33:01.4681925Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.4682279Z     
2025-05-07T20:33:01.4682472Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.4682768Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.4683083Z         x = x_sign * x_clamp
2025-05-07T20:33:01.4683322Z         x0 = x[:, :D]
2025-05-07T20:33:01.4683546Z         x1 = x[:, D:]
2025-05-07T20:33:01.4683764Z     
2025-05-07T20:33:01.4684047Z         if contiguous:
2025-05-07T20:33:01.4684285Z             x0 = x0.contiguous()
2025-05-07T20:33:01.4684549Z             x1 = x1.contiguous()
2025-05-07T20:33:01.4684795Z     
2025-05-07T20:33:01.4684987Z         if scale_ub is not None:
2025-05-07T20:33:01.4685334Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.4685687Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.4685994Z             )
2025-05-07T20:33:01.4686196Z         else:
2025-05-07T20:33:01.4686416Z             scale_ub_tensor = None
2025-05-07T20:33:01.4686670Z     
2025-05-07T20:33:01.4686973Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.4687299Z             op = silu_mul_quant
2025-05-07T20:33:01.4687555Z             if compiled:
2025-05-07T20:33:01.4687812Z                 op = torch.compile(op)
2025-05-07T20:33:01.4688124Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.4688403Z     
2025-05-07T20:33:01.4688607Z         y_fp8, y_scale = fn()
2025-05-07T20:33:01.4688911Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:01.4689216Z     
2025-05-07T20:33:01.4689454Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.4689802Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:01.4690108Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:01.4690465Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:01.4690850Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.4691174Z     
2025-05-07T20:33:01.4691450Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:01.4691659Z 
2025-05-07T20:33:01.4691765Z moe/activation_test.py:126: 
2025-05-07T20:33:01.4692070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.4692414Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:01.4692745Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.4693545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:01.4694330Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:01.4694886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.4695586Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.4696288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:01.4697021Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.4697771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:01.4698527Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.4699266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:01.4699911Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:01.4700518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:01.4701040Z     fn()
2025-05-07T20:33:01.4701551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:01.4702135Z     self.fn.run(
2025-05-07T20:33:01.4702617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.4703155Z     kernel = self.compile(
2025-05-07T20:33:01.4703852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.4704509Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.4704982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.4705212Z 
2025-05-07T20:33:01.4705424Z self = <triton.compiler.compiler.ASTSource object at 0x7f89182161f0>
2025-05-07T20:33:01.4706571Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.4707973Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89184b79d0>}
2025-05-07T20:33:01.4709411Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.4710550Z context = <triton._C.libtriton.ir.context object at 0x7f8917deccb0>
2025-05-07T20:33:01.4710840Z 
2025-05-07T20:33:01.4711013Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.4711531Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.4712014Z                            module_map=module_map)
2025-05-07T20:33:01.4712382Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.4712737Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:01.4713010Z E       ^
2025-05-07T20:33:01.4713552Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.4714028Z 
2025-05-07T20:33:01.4714445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.4714972Z 
2025-05-07T20:33:01.4715078Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.4715504Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.4715919Z     T=1,
2025-05-07T20:33:01.4716109Z     D=5120,
2025-05-07T20:33:01.4716313Z     scale_ub=None,
2025-05-07T20:33:01.4716526Z     contiguous=True,
2025-05-07T20:33:01.4716769Z     compiled=False,
2025-05-07T20:33:01.4716983Z )
2025-05-07T20:33:01.8254142Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.8255437Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:01.8255965Z 
2025-05-07T20:33:01.8256124Z     @given(
2025-05-07T20:33:01.8256601Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.8257225Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.8257833Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.8258480Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.8259135Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.8259700Z     )
2025-05-07T20:33:01.8260387Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.8260888Z     def test_silu_mul_quant(
2025-05-07T20:33:01.8261130Z         self,
2025-05-07T20:33:01.8261327Z         T: int,
2025-05-07T20:33:01.8261528Z         D: int,
2025-05-07T20:33:01.8261743Z         scale_ub: Optional[float],
2025-05-07T20:33:01.8262009Z         contiguous: bool,
2025-05-07T20:33:01.8262246Z         compiled: bool,
2025-05-07T20:33:01.8262474Z     ) -> None:
2025-05-07T20:33:01.8262687Z         torch.manual_seed(2025)
2025-05-07T20:33:01.8262933Z     
2025-05-07T20:33:01.8263211Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.8263563Z     
2025-05-07T20:33:01.8263754Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.8264048Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.8264359Z         x = x_sign * x_clamp
2025-05-07T20:33:01.8264590Z         x0 = x[:, :D]
2025-05-07T20:33:01.8264923Z         x1 = x[:, D:]
2025-05-07T20:33:01.8265130Z     
2025-05-07T20:33:01.8265310Z         if contiguous:
2025-05-07T20:33:01.8265542Z             x0 = x0.contiguous()
2025-05-07T20:33:01.8265804Z             x1 = x1.contiguous()
2025-05-07T20:33:01.8266047Z     
2025-05-07T20:33:01.8266306Z         if scale_ub is not None:
2025-05-07T20:33:01.8266586Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.8266923Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.8267236Z             )
2025-05-07T20:33:01.8267437Z         else:
2025-05-07T20:33:01.8267712Z             scale_ub_tensor = None
2025-05-07T20:33:01.8267970Z     
2025-05-07T20:33:01.8268212Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.8268529Z             op = silu_mul_quant
2025-05-07T20:33:01.8268777Z             if compiled:
2025-05-07T20:33:01.8269028Z                 op = torch.compile(op)
2025-05-07T20:33:01.8269327Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.8269601Z     
2025-05-07T20:33:01.8269879Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.8270062Z 
2025-05-07T20:33:01.8270163Z moe/activation_test.py:117: 
2025-05-07T20:33:01.8270467Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.8270799Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.8271089Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.8271781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.8272553Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.8273099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.8273783Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.8274443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.8274975Z     kernel = self.compile(
2025-05-07T20:33:01.8275521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.8276175Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.8276565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.8276796Z 
2025-05-07T20:33:01.8277002Z self = <triton.compiler.compiler.ASTSource object at 0x7f8918219b50>
2025-05-07T20:33:01.8278096Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.8279484Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8918480940>}
2025-05-07T20:33:01.8280879Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.8281908Z context = <triton._C.libtriton.ir.context object at 0x7f8917d18230>
2025-05-07T20:33:01.8282198Z 
2025-05-07T20:33:01.8282370Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.8282894Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.8283364Z                            module_map=module_map)
2025-05-07T20:33:01.8283730Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.8284082Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.8284334Z E       ^
2025-05-07T20:33:01.8284808Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.8285314Z 
2025-05-07T20:33:01.8285730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.8286243Z 
2025-05-07T20:33:01.8286389Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.8286804Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.8287213Z     T=128,
2025-05-07T20:33:01.8287400Z     D=5120,
2025-05-07T20:33:01.8287586Z     scale_ub=None,
2025-05-07T20:33:01.8287803Z     contiguous=False,
2025-05-07T20:33:01.8288069Z     compiled=True,
2025-05-07T20:33:01.8288275Z )
2025-05-07T20:33:01.8288592Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.8289084Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:01.8289352Z 
2025-05-07T20:33:01.8289435Z     @given(
2025-05-07T20:33:01.8289658Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.8289980Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.8290294Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.8290646Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.8291009Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.8291295Z     )
2025-05-07T20:33:01.8291640Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.8292073Z     def test_silu_mul_quant(
2025-05-07T20:33:01.8292316Z         self,
2025-05-07T20:33:01.8292514Z         T: int,
2025-05-07T20:33:01.8292753Z         D: int,
2025-05-07T20:33:01.8292976Z         scale_ub: Optional[float],
2025-05-07T20:33:01.8293251Z         contiguous: bool,
2025-05-07T20:33:01.8293482Z         compiled: bool,
2025-05-07T20:33:01.8293702Z     ) -> None:
2025-05-07T20:33:01.8293917Z         torch.manual_seed(2025)
2025-05-07T20:33:01.8294156Z     
2025-05-07T20:33:01.8294426Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.8294771Z     
2025-05-07T20:33:01.8294960Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.8295251Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.8295566Z         x = x_sign * x_clamp
2025-05-07T20:33:01.8295802Z         x0 = x[:, :D]
2025-05-07T20:33:01.8296020Z         x1 = x[:, D:]
2025-05-07T20:33:01.8296226Z     
2025-05-07T20:33:01.8296402Z         if contiguous:
2025-05-07T20:33:01.8296631Z             x0 = x0.contiguous()
2025-05-07T20:33:01.8296889Z             x1 = x1.contiguous()
2025-05-07T20:33:01.8297136Z     
2025-05-07T20:33:01.8297328Z         if scale_ub is not None:
2025-05-07T20:33:01.8297598Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.8297936Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.8298238Z             )
2025-05-07T20:33:01.8298431Z         else:
2025-05-07T20:33:01.8298637Z             scale_ub_tensor = None
2025-05-07T20:33:01.8298886Z     
2025-05-07T20:33:01.8299119Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.8299433Z             op = silu_mul_quant
2025-05-07T20:33:01.8299677Z             if compiled:
2025-05-07T20:33:01.8299924Z                 op = torch.compile(op)
2025-05-07T20:33:01.8300251Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.8300544Z     
2025-05-07T20:33:01.8300735Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.8300897Z 
2025-05-07T20:33:01.8301000Z moe/activation_test.py:117: 
2025-05-07T20:33:01.8301294Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.8301628Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.8301918Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.8302475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.8303029Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.8303686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.8304705Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.8305329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.8306009Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.8306672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.8307203Z     kernel = self.compile(
2025-05-07T20:33:01.8307805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.8308460Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.8308860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.8309087Z 
2025-05-07T20:33:01.8309315Z self = <triton.compiler.compiler.ASTSource object at 0x7f891826e0a0>
2025-05-07T20:33:01.8310519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.8312084Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917cbd040>}
2025-05-07T20:33:01.8313651Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.8314676Z context = <triton._C.libtriton.ir.context object at 0x7f89181088b0>
2025-05-07T20:33:01.8314958Z 
2025-05-07T20:33:01.8315129Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.8315652Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.8316122Z                            module_map=module_map)
2025-05-07T20:33:01.8316483Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.8316827Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.8317078Z E       ^
2025-05-07T20:33:01.8317545Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.8317992Z 
2025-05-07T20:33:01.8318417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.8318931Z 
2025-05-07T20:33:01.8319033Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.8319443Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.8319849Z     T=128,
2025-05-07T20:33:01.8320031Z     D=7168,
2025-05-07T20:33:01.8320216Z     scale_ub=1200.0,
2025-05-07T20:33:01.8320443Z     contiguous=False,
2025-05-07T20:33:01.8320699Z     compiled=False,
2025-05-07T20:33:01.8320908Z )
2025-05-07T20:33:01.9848381Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.9848953Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:01.9849231Z 
2025-05-07T20:33:01.9849310Z     @given(
2025-05-07T20:33:01.9849614Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.9850026Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.9850340Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.9850674Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.9856969Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.9857297Z     )
2025-05-07T20:33:01.9857651Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.9858215Z     def test_silu_mul_quant(
2025-05-07T20:33:01.9858460Z         self,
2025-05-07T20:33:01.9858653Z         T: int,
2025-05-07T20:33:01.9858845Z         D: int,
2025-05-07T20:33:01.9859066Z         scale_ub: Optional[float],
2025-05-07T20:33:01.9859336Z         contiguous: bool,
2025-05-07T20:33:01.9859643Z         compiled: bool,
2025-05-07T20:33:01.9859869Z     ) -> None:
2025-05-07T20:33:01.9860078Z         torch.manual_seed(2025)
2025-05-07T20:33:01.9860329Z     
2025-05-07T20:33:01.9860611Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.9860957Z     
2025-05-07T20:33:01.9861219Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.9861512Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.9861824Z         x = x_sign * x_clamp
2025-05-07T20:33:01.9862065Z         x0 = x[:, :D]
2025-05-07T20:33:01.9862278Z         x1 = x[:, D:]
2025-05-07T20:33:01.9862484Z     
2025-05-07T20:33:01.9862664Z         if contiguous:
2025-05-07T20:33:01.9862903Z             x0 = x0.contiguous()
2025-05-07T20:33:01.9863162Z             x1 = x1.contiguous()
2025-05-07T20:33:01.9863398Z     
2025-05-07T20:33:01.9863592Z         if scale_ub is not None:
2025-05-07T20:33:01.9863874Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.9864209Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.9864520Z             )
2025-05-07T20:33:01.9864710Z         else:
2025-05-07T20:33:01.9864913Z             scale_ub_tensor = None
2025-05-07T20:33:01.9865164Z     
2025-05-07T20:33:01.9865396Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.9865791Z             op = silu_mul_quant
2025-05-07T20:33:01.9866037Z             if compiled:
2025-05-07T20:33:01.9866290Z                 op = torch.compile(op)
2025-05-07T20:33:01.9866591Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.9866862Z     
2025-05-07T20:33:01.9867055Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.9867221Z 
2025-05-07T20:33:01.9867327Z moe/activation_test.py:117: 
2025-05-07T20:33:01.9867616Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.9867951Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.9868228Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.9868917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.9869619Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.9870236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.9870924Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.9871589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.9872121Z     kernel = self.compile(
2025-05-07T20:33:01.9872668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.9873323Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.9873716Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.9873947Z 
2025-05-07T20:33:01.9874151Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917d419d0>
2025-05-07T20:33:01.9875236Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.9876619Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917cbdd30>}
2025-05-07T20:33:01.9877991Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.9879070Z context = <triton._C.libtriton.ir.context object at 0x7f8917d024b0>
2025-05-07T20:33:01.9879354Z 
2025-05-07T20:33:01.9879556Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.9880080Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.9880595Z                            module_map=module_map)
2025-05-07T20:33:01.9880962Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.9881351Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.9881616Z E       ^
2025-05-07T20:33:01.9882088Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.9882540Z 
2025-05-07T20:33:01.9882959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.9883479Z 
2025-05-07T20:33:01.9883580Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.9883994Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.9884410Z     T=128,
2025-05-07T20:33:01.9884594Z     D=5120,
2025-05-07T20:33:01.9884782Z     scale_ub=None,
2025-05-07T20:33:01.9884997Z     contiguous=False,
2025-05-07T20:33:01.9885216Z     compiled=False,
2025-05-07T20:33:01.9885423Z )
2025-05-07T20:33:01.9885745Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.9886280Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:01.9886555Z 
2025-05-07T20:33:01.9886634Z     @given(
2025-05-07T20:33:01.9886863Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.9887171Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.9887476Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.9887812Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.9888142Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.9888419Z     )
2025-05-07T20:33:01.9888766Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.9889208Z     def test_silu_mul_quant(
2025-05-07T20:33:01.9889441Z         self,
2025-05-07T20:33:01.9889637Z         T: int,
2025-05-07T20:33:01.9889830Z         D: int,
2025-05-07T20:33:01.9890040Z         scale_ub: Optional[float],
2025-05-07T20:33:01.9890315Z         contiguous: bool,
2025-05-07T20:33:01.9890598Z         compiled: bool,
2025-05-07T20:33:01.9890830Z     ) -> None:
2025-05-07T20:33:01.9891041Z         torch.manual_seed(2025)
2025-05-07T20:33:01.9891279Z     
2025-05-07T20:33:01.9891541Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.9891878Z     
2025-05-07T20:33:01.9892067Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.9892356Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.9892655Z         x = x_sign * x_clamp
2025-05-07T20:33:01.9892893Z         x0 = x[:, :D]
2025-05-07T20:33:01.9893111Z         x1 = x[:, D:]
2025-05-07T20:33:01.9893311Z     
2025-05-07T20:33:01.9893497Z         if contiguous:
2025-05-07T20:33:01.9893727Z             x0 = x0.contiguous()
2025-05-07T20:33:01.9893977Z             x1 = x1.contiguous()
2025-05-07T20:33:01.9894215Z     
2025-05-07T20:33:01.9894403Z         if scale_ub is not None:
2025-05-07T20:33:01.9894665Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.9895007Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.9895313Z             )
2025-05-07T20:33:01.9895497Z         else:
2025-05-07T20:33:01.9895706Z             scale_ub_tensor = None
2025-05-07T20:33:01.9895958Z     
2025-05-07T20:33:01.9896181Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.9896491Z             op = silu_mul_quant
2025-05-07T20:33:01.9896792Z             if compiled:
2025-05-07T20:33:01.9897040Z                 op = torch.compile(op)
2025-05-07T20:33:01.9897329Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.9897602Z     
2025-05-07T20:33:01.9897829Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.9897999Z 
2025-05-07T20:33:01.9898098Z moe/activation_test.py:117: 
2025-05-07T20:33:01.9898394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.9898720Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.9898993Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.9899725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.9900419Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.9900957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.9901638Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.9902294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.9902825Z     kernel = self.compile(
2025-05-07T20:33:01.9903361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.9904607Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.9905002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.9905307Z 
2025-05-07T20:33:01.9905515Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917bf2850>
2025-05-07T20:33:01.9906597Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.9907982Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917c82310>}
2025-05-07T20:33:01.9909338Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.9910456Z context = <triton._C.libtriton.ir.context object at 0x7f8917c879f0>
2025-05-07T20:33:01.9910742Z 
2025-05-07T20:33:01.9910916Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.9911433Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.9911896Z                            module_map=module_map)
2025-05-07T20:33:01.9912266Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.9912616Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.9912873Z E       ^
2025-05-07T20:33:01.9913340Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.9913792Z 
2025-05-07T20:33:01.9914216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.9914728Z 
2025-05-07T20:33:01.9914832Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.9915247Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.9915662Z     T=128,
2025-05-07T20:33:01.9915841Z     D=5120,
2025-05-07T20:33:01.9916029Z     scale_ub=1200.0,
2025-05-07T20:33:01.9916248Z     contiguous=True,
2025-05-07T20:33:01.9916460Z     compiled=False,
2025-05-07T20:33:01.9916670Z )
2025-05-07T20:33:02.2191202Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.2191731Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:02.2192125Z 
2025-05-07T20:33:02.2192203Z     @given(
2025-05-07T20:33:02.2192499Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.2192889Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.2193272Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.2193607Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.2193937Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.2194225Z     )
2025-05-07T20:33:02.2194579Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.2195112Z     def test_silu_mul_quant(
2025-05-07T20:33:02.2195354Z         self,
2025-05-07T20:33:02.2195548Z         T: int,
2025-05-07T20:33:02.2195744Z         D: int,
2025-05-07T20:33:02.2195963Z         scale_ub: Optional[float],
2025-05-07T20:33:02.2196235Z         contiguous: bool,
2025-05-07T20:33:02.2196475Z         compiled: bool,
2025-05-07T20:33:02.2196706Z     ) -> None:
2025-05-07T20:33:02.2196918Z         torch.manual_seed(2025)
2025-05-07T20:33:02.2197162Z     
2025-05-07T20:33:02.2197432Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.2197772Z     
2025-05-07T20:33:02.2197967Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.2198285Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.2198596Z         x = x_sign * x_clamp
2025-05-07T20:33:02.2198841Z         x0 = x[:, :D]
2025-05-07T20:33:02.2199056Z         x1 = x[:, D:]
2025-05-07T20:33:02.2199263Z     
2025-05-07T20:33:02.2199519Z         if contiguous:
2025-05-07T20:33:02.2199747Z             x0 = x0.contiguous()
2025-05-07T20:33:02.2200008Z             x1 = x1.contiguous()
2025-05-07T20:33:02.2200249Z     
2025-05-07T20:33:02.2200435Z         if scale_ub is not None:
2025-05-07T20:33:02.2200707Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.2201050Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.2201367Z             )
2025-05-07T20:33:02.2201563Z         else:
2025-05-07T20:33:02.2201771Z             scale_ub_tensor = None
2025-05-07T20:33:02.2202023Z     
2025-05-07T20:33:02.2202259Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.2202574Z             op = silu_mul_quant
2025-05-07T20:33:02.2202823Z             if compiled:
2025-05-07T20:33:02.2203078Z                 op = torch.compile(op)
2025-05-07T20:33:02.2203371Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.2203646Z     
2025-05-07T20:33:02.2204021Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.2204192Z 
2025-05-07T20:33:02.2204295Z moe/activation_test.py:117: 
2025-05-07T20:33:02.2204597Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.2204931Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.2205210Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.2205903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.2206598Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.2207139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.2207817Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.2208484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.2209015Z     kernel = self.compile(
2025-05-07T20:33:02.2209560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.2210210Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.2210604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.2210831Z 
2025-05-07T20:33:02.2211112Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917c6b550>
2025-05-07T20:33:02.2212251Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.2213643Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917c82ee0>}
2025-05-07T20:33:02.2214999Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.2216098Z context = <triton._C.libtriton.ir.context object at 0x7f89181ece70>
2025-05-07T20:33:02.2216386Z 
2025-05-07T20:33:02.2216557Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.2217084Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.2217546Z                            module_map=module_map)
2025-05-07T20:33:02.2217911Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.2218260Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.2218508Z E       ^
2025-05-07T20:33:02.2218974Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.2219426Z 
2025-05-07T20:33:02.2219907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.2220450Z 
2025-05-07T20:33:02.2220576Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.2220981Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.2221379Z     T=1,
2025-05-07T20:33:02.2221554Z     D=7168,
2025-05-07T20:33:02.2221742Z     scale_ub=1200.0,
2025-05-07T20:33:02.2221962Z     contiguous=True,
2025-05-07T20:33:02.2222180Z     compiled=True,
2025-05-07T20:33:02.2222373Z )
2025-05-07T20:33:02.2222687Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.2223172Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:02.2223431Z 
2025-05-07T20:33:02.2223507Z     @given(
2025-05-07T20:33:02.2223732Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.2224050Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.2224362Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.2224681Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.2225006Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.2225288Z     )
2025-05-07T20:33:02.2225633Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.2226079Z     def test_silu_mul_quant(
2025-05-07T20:33:02.2226313Z         self,
2025-05-07T20:33:02.2226495Z         T: int,
2025-05-07T20:33:02.2226680Z         D: int,
2025-05-07T20:33:02.2226893Z         scale_ub: Optional[float],
2025-05-07T20:33:02.2227151Z         contiguous: bool,
2025-05-07T20:33:02.2227381Z         compiled: bool,
2025-05-07T20:33:02.2227599Z     ) -> None:
2025-05-07T20:33:02.2227808Z         torch.manual_seed(2025)
2025-05-07T20:33:02.2228048Z     
2025-05-07T20:33:02.2228310Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.2228641Z     
2025-05-07T20:33:02.2228824Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.2229101Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.2229401Z         x = x_sign * x_clamp
2025-05-07T20:33:02.2229627Z         x0 = x[:, :D]
2025-05-07T20:33:02.2229903Z         x1 = x[:, D:]
2025-05-07T20:33:02.2230104Z     
2025-05-07T20:33:02.2230276Z         if contiguous:
2025-05-07T20:33:02.2230580Z             x0 = x0.contiguous()
2025-05-07T20:33:02.2230851Z             x1 = x1.contiguous()
2025-05-07T20:33:02.2231081Z     
2025-05-07T20:33:02.2231264Z         if scale_ub is not None:
2025-05-07T20:33:02.2231527Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.2231889Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.2232188Z             )
2025-05-07T20:33:02.2232373Z         else:
2025-05-07T20:33:02.2232568Z             scale_ub_tensor = None
2025-05-07T20:33:02.2232809Z     
2025-05-07T20:33:02.2233029Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.2233375Z             op = silu_mul_quant
2025-05-07T20:33:02.2233613Z             if compiled:
2025-05-07T20:33:02.2233848Z                 op = torch.compile(op)
2025-05-07T20:33:02.2234136Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.2234395Z     
2025-05-07T20:33:02.2234580Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.2234740Z 
2025-05-07T20:33:02.2234847Z moe/activation_test.py:117: 
2025-05-07T20:33:02.2235127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.2235450Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.2235717Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.2236271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:02.2236820Z     return fn(*args, **kwargs)
2025-05-07T20:33:02.2237518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.2238203Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.2238727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.2239399Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.2240056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.2240630Z     kernel = self.compile(
2025-05-07T20:33:02.2241159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.2241804Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.2242192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.2242414Z 
2025-05-07T20:33:02.2242614Z self = <triton.compiler.compiler.ASTSource object at 0x7f891807be80>
2025-05-07T20:33:02.2243699Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.2245069Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89181f1940>}
2025-05-07T20:33:02.2246413Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.2247433Z context = <triton._C.libtriton.ir.context object at 0x7f89183141f0>
2025-05-07T20:33:02.2247719Z 
2025-05-07T20:33:02.2247887Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.2248409Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.2248870Z                            module_map=module_map)
2025-05-07T20:33:02.2249232Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.2249573Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.2249825Z E       ^
2025-05-07T20:33:02.2250288Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.2250783Z 
2025-05-07T20:33:02.2251204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.2251727Z 
2025-05-07T20:33:02.2251864Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.2252272Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.2252672Z     T=1,
2025-05-07T20:33:02.2252841Z     D=7168,
2025-05-07T20:33:02.2253018Z     scale_ub=1200.0,
2025-05-07T20:33:02.2253238Z     contiguous=False,
2025-05-07T20:33:02.2253491Z     compiled=True,
2025-05-07T20:33:02.2253683Z )
2025-05-07T20:33:02.5564712Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.5565232Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:02.5565538Z 
2025-05-07T20:33:02.5565647Z     @given(
2025-05-07T20:33:02.5565964Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.5566275Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.5566577Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.5566901Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.5567229Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.5567508Z     )
2025-05-07T20:33:02.5567858Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.5568293Z     def test_silu_mul_quant(
2025-05-07T20:33:02.5568528Z         self,
2025-05-07T20:33:02.5568837Z         T: int,
2025-05-07T20:33:02.5569038Z         D: int,
2025-05-07T20:33:02.5569248Z         scale_ub: Optional[float],
2025-05-07T20:33:02.5569516Z         contiguous: bool,
2025-05-07T20:33:02.5569772Z         compiled: bool,
2025-05-07T20:33:02.5569995Z     ) -> None:
2025-05-07T20:33:02.5570206Z         torch.manual_seed(2025)
2025-05-07T20:33:02.5570445Z     
2025-05-07T20:33:02.5570762Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.5571096Z     
2025-05-07T20:33:02.5571287Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.5571575Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.5571884Z         x = x_sign * x_clamp
2025-05-07T20:33:02.5572115Z         x0 = x[:, :D]
2025-05-07T20:33:02.5572327Z         x1 = x[:, D:]
2025-05-07T20:33:02.5572531Z     
2025-05-07T20:33:02.5572707Z         if contiguous:
2025-05-07T20:33:02.5572938Z             x0 = x0.contiguous()
2025-05-07T20:33:02.5573192Z             x1 = x1.contiguous()
2025-05-07T20:33:02.5573432Z     
2025-05-07T20:33:02.5573624Z         if scale_ub is not None:
2025-05-07T20:33:02.5573891Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.5574220Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.5574524Z             )
2025-05-07T20:33:02.5574712Z         else:
2025-05-07T20:33:02.5574913Z             scale_ub_tensor = None
2025-05-07T20:33:02.5575174Z     
2025-05-07T20:33:02.5575404Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.5575713Z             op = silu_mul_quant
2025-05-07T20:33:02.5575961Z             if compiled:
2025-05-07T20:33:02.5576211Z                 op = torch.compile(op)
2025-05-07T20:33:02.5576509Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.5576776Z     
2025-05-07T20:33:02.5576976Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.5577141Z 
2025-05-07T20:33:02.5577251Z moe/activation_test.py:117: 
2025-05-07T20:33:02.5577546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.5577880Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.5578163Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.5578715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:02.5579276Z     return fn(*args, **kwargs)
2025-05-07T20:33:02.5585525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.5586270Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.5586929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.5587626Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.5588295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.5588890Z     kernel = self.compile(
2025-05-07T20:33:02.5589440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.5590198Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.5590630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.5590879Z 
2025-05-07T20:33:02.5591085Z self = <triton.compiler.compiler.ASTSource object at 0x7f8918329c40>
2025-05-07T20:33:02.5592173Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.5593561Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917fec5e0>}
2025-05-07T20:33:02.5594957Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.5595977Z context = <triton._C.libtriton.ir.context object at 0x7f8917ea6670>
2025-05-07T20:33:02.5596265Z 
2025-05-07T20:33:02.5596430Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.5596956Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.5597425Z                            module_map=module_map)
2025-05-07T20:33:02.5597788Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.5598142Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.5598395Z E       ^
2025-05-07T20:33:02.5598864Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.5599326Z 
2025-05-07T20:33:02.5599748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.5600268Z 
2025-05-07T20:33:02.5600372Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.5600834Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.5601236Z     T=1,
2025-05-07T20:33:02.5601418Z     D=7168,
2025-05-07T20:33:02.5601606Z     scale_ub=None,
2025-05-07T20:33:02.5601812Z     contiguous=False,
2025-05-07T20:33:02.5602034Z     compiled=True,
2025-05-07T20:33:02.5602234Z )
2025-05-07T20:33:02.6734897Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.6735495Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:02.6735881Z 
2025-05-07T20:33:02.6735961Z     @given(
2025-05-07T20:33:02.6736187Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.6736507Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.6736812Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.6737140Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.6737471Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.6737752Z     )
2025-05-07T20:33:02.6738097Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.6738650Z     def test_silu_mul_quant(
2025-05-07T20:33:02.6738881Z         self,
2025-05-07T20:33:02.6739072Z         T: int,
2025-05-07T20:33:02.6739269Z         D: int,
2025-05-07T20:33:02.6739479Z         scale_ub: Optional[float],
2025-05-07T20:33:02.6739812Z         contiguous: bool,
2025-05-07T20:33:02.6740055Z         compiled: bool,
2025-05-07T20:33:02.6740274Z     ) -> None:
2025-05-07T20:33:02.6740484Z         torch.manual_seed(2025)
2025-05-07T20:33:02.6740769Z     
2025-05-07T20:33:02.6741043Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.6741450Z     
2025-05-07T20:33:02.6741641Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.6741924Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.6742224Z         x = x_sign * x_clamp
2025-05-07T20:33:02.6742463Z         x0 = x[:, :D]
2025-05-07T20:33:02.6742677Z         x1 = x[:, D:]
2025-05-07T20:33:02.6742875Z     
2025-05-07T20:33:02.6743062Z         if contiguous:
2025-05-07T20:33:02.6743295Z             x0 = x0.contiguous()
2025-05-07T20:33:02.6743547Z             x1 = x1.contiguous()
2025-05-07T20:33:02.6743784Z     
2025-05-07T20:33:02.6743971Z         if scale_ub is not None:
2025-05-07T20:33:02.6744245Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.6744575Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.6744881Z             )
2025-05-07T20:33:02.6745070Z         else:
2025-05-07T20:33:02.6745272Z             scale_ub_tensor = None
2025-05-07T20:33:02.6745526Z     
2025-05-07T20:33:02.6745826Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.6746140Z             op = silu_mul_quant
2025-05-07T20:33:02.6746394Z             if compiled:
2025-05-07T20:33:02.6746641Z                 op = torch.compile(op)
2025-05-07T20:33:02.6746930Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.6747204Z     
2025-05-07T20:33:02.6747395Z         y_fp8, y_scale = fn()
2025-05-07T20:33:02.6747682Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:02.6747970Z     
2025-05-07T20:33:02.6748210Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.6748549Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:02.6748844Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:02.6749157Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:02.6749514Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:02.6749888Z     
2025-05-07T20:33:02.6750085Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:02.6750283Z 
2025-05-07T20:33:02.6750385Z moe/activation_test.py:126: 
2025-05-07T20:33:02.6750675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.6751055Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:02.6751382Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:02.6752166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:02.6752925Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:02.6753468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.6754148Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.6754823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:02.6755548Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:02.6756296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:02.6757038Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:02.6757807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:02.6758441Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:02.6759081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:02.6759594Z     fn()
2025-05-07T20:33:02.6760091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:02.6760668Z     self.fn.run(
2025-05-07T20:33:02.6761132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.6761691Z     kernel = self.compile(
2025-05-07T20:33:02.6762223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.6762870Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.6763266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.6763493Z 
2025-05-07T20:33:02.6763694Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917ea9f10>
2025-05-07T20:33:02.6764787Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.6766207Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917e44160>}
2025-05-07T20:33:02.6767550Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.6768570Z context = <triton._C.libtriton.ir.context object at 0x7f8917e45670>
2025-05-07T20:33:02.6768855Z 
2025-05-07T20:33:02.6769016Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.6769532Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.6769995Z                            module_map=module_map)
2025-05-07T20:33:02.6770350Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.6770750Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:02.6771007Z E       ^
2025-05-07T20:33:02.6771472Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.6771925Z 
2025-05-07T20:33:02.6772337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.6772848Z 
2025-05-07T20:33:02.6772948Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.6773350Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.6773745Z     T=1,
2025-05-07T20:33:02.6773919Z     D=5120,
2025-05-07T20:33:02.6774101Z     scale_ub=1200.0,
2025-05-07T20:33:02.6774310Z     contiguous=False,
2025-05-07T20:33:02.6774531Z     compiled=True,
2025-05-07T20:33:02.6774730Z )
2025-05-07T20:33:02.8762080Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.8762817Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:02.8763179Z 
2025-05-07T20:33:02.8763280Z     @given(
2025-05-07T20:33:02.8763604Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.8764008Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.8764309Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.8764634Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.8764953Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.8765357Z     )
2025-05-07T20:33:02.8765694Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.8766128Z     def test_silu_mul_quant(
2025-05-07T20:33:02.8766361Z         self,
2025-05-07T20:33:02.8766541Z         T: int,
2025-05-07T20:33:02.8766798Z         D: int,
2025-05-07T20:33:02.8767014Z         scale_ub: Optional[float],
2025-05-07T20:33:02.8767275Z         contiguous: bool,
2025-05-07T20:33:02.8767506Z         compiled: bool,
2025-05-07T20:33:02.8767721Z     ) -> None:
2025-05-07T20:33:02.8767926Z         torch.manual_seed(2025)
2025-05-07T20:33:02.8768158Z     
2025-05-07T20:33:02.8768492Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.8768827Z     
2025-05-07T20:33:02.8769017Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.8769307Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.8769605Z         x = x_sign * x_clamp
2025-05-07T20:33:02.8769840Z         x0 = x[:, :D]
2025-05-07T20:33:02.8770050Z         x1 = x[:, D:]
2025-05-07T20:33:02.8770245Z     
2025-05-07T20:33:02.8770449Z         if contiguous:
2025-05-07T20:33:02.8770695Z             x0 = x0.contiguous()
2025-05-07T20:33:02.8770947Z             x1 = x1.contiguous()
2025-05-07T20:33:02.8771178Z     
2025-05-07T20:33:02.8771367Z         if scale_ub is not None:
2025-05-07T20:33:02.8771630Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.8771952Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.8772257Z             )
2025-05-07T20:33:02.8772440Z         else:
2025-05-07T20:33:02.8772729Z             scale_ub_tensor = None
2025-05-07T20:33:02.8772978Z     
2025-05-07T20:33:02.8773203Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.8773507Z             op = silu_mul_quant
2025-05-07T20:33:02.8773757Z             if compiled:
2025-05-07T20:33:02.8774001Z                 op = torch.compile(op)
2025-05-07T20:33:02.8774284Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.8774554Z     
2025-05-07T20:33:02.8774738Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.8774900Z 
2025-05-07T20:33:02.8775001Z moe/activation_test.py:117: 
2025-05-07T20:33:02.8775290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.8775620Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.8775896Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.8776442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:02.8776993Z     return fn(*args, **kwargs)
2025-05-07T20:33:02.8777650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.8778339Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.8778864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.8779539Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.8780191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.8780772Z     kernel = self.compile(
2025-05-07T20:33:02.8781311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.8781966Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.8782356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.8782587Z 
2025-05-07T20:33:02.8782799Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917e78220>
2025-05-07T20:33:02.8783874Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.8785301Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917e44b80>}
2025-05-07T20:33:02.8786679Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.8787701Z context = <triton._C.libtriton.ir.context object at 0x7f8917e577b0>
2025-05-07T20:33:02.8787980Z 
2025-05-07T20:33:02.8788194Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.8788706Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.8789163Z                            module_map=module_map)
2025-05-07T20:33:02.8789525Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.8789946Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.8790205Z E       ^
2025-05-07T20:33:02.8790675Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.8791177Z 
2025-05-07T20:33:02.8791597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.8792104Z 
2025-05-07T20:33:02.8792201Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.8792607Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.8793009Z     T=1,
2025-05-07T20:33:02.8793227Z     D=5120,
2025-05-07T20:33:02.8793408Z     scale_ub=1200.0,
2025-05-07T20:33:02.8793625Z     contiguous=False,
2025-05-07T20:33:02.8793837Z     compiled=False,
2025-05-07T20:33:02.8794034Z )
2025-05-07T20:33:02.8794338Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.8794817Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:02.8795083Z 
2025-05-07T20:33:02.8795157Z     @given(
2025-05-07T20:33:02.8795374Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.8795676Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.8795974Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.8796296Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.8796612Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.8796883Z     )
2025-05-07T20:33:02.8797227Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.8797660Z     def test_silu_mul_quant(
2025-05-07T20:33:02.8797891Z         self,
2025-05-07T20:33:02.8798072Z         T: int,
2025-05-07T20:33:02.8798256Z         D: int,
2025-05-07T20:33:02.8798471Z         scale_ub: Optional[float],
2025-05-07T20:33:02.8798726Z         contiguous: bool,
2025-05-07T20:33:02.8798953Z         compiled: bool,
2025-05-07T20:33:02.8799169Z     ) -> None:
2025-05-07T20:33:02.8799374Z         torch.manual_seed(2025)
2025-05-07T20:33:02.8799613Z     
2025-05-07T20:33:02.8799877Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.8800211Z     
2025-05-07T20:33:02.8800416Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.8800735Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.8801030Z         x = x_sign * x_clamp
2025-05-07T20:33:02.8801265Z         x0 = x[:, :D]
2025-05-07T20:33:02.8801475Z         x1 = x[:, D:]
2025-05-07T20:33:02.8801666Z     
2025-05-07T20:33:02.8801851Z         if contiguous:
2025-05-07T20:33:02.8802074Z             x0 = x0.contiguous()
2025-05-07T20:33:02.8802320Z             x1 = x1.contiguous()
2025-05-07T20:33:02.8802555Z     
2025-05-07T20:33:02.8802733Z         if scale_ub is not None:
2025-05-07T20:33:02.8803009Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.8803329Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.8803678Z             )
2025-05-07T20:33:02.8804047Z         else:
2025-05-07T20:33:02.8804242Z             scale_ub_tensor = None
2025-05-07T20:33:02.8804487Z     
2025-05-07T20:33:02.8804710Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.8805080Z             op = silu_mul_quant
2025-05-07T20:33:02.8805330Z             if compiled:
2025-05-07T20:33:02.8805566Z                 op = torch.compile(op)
2025-05-07T20:33:02.8805856Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.8806115Z     
2025-05-07T20:33:02.8806294Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.8806522Z 
2025-05-07T20:33:02.8806622Z moe/activation_test.py:117: 
2025-05-07T20:33:02.8806906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.8807228Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.8807501Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.8808183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.8808869Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.8809406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.8810078Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.8810777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.8811303Z     kernel = self.compile(
2025-05-07T20:33:02.8811904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.8812562Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.8812943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.8813169Z 
2025-05-07T20:33:02.8813374Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917e0f4f0>
2025-05-07T20:33:02.8814450Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.8815823Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917927550>}
2025-05-07T20:33:02.8817161Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.8818177Z context = <triton._C.libtriton.ir.context object at 0x7f89179490b0>
2025-05-07T20:33:02.8818461Z 
2025-05-07T20:33:02.8818621Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.8819143Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.8819598Z                            module_map=module_map)
2025-05-07T20:33:02.8819954Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.8820300Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.8820546Z E       ^
2025-05-07T20:33:02.8821053Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.8821512Z 
2025-05-07T20:33:02.8821928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.8822439Z 
2025-05-07T20:33:02.8822542Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.8822947Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.8823345Z     T=16384,
2025-05-07T20:33:02.8823528Z     D=5120,
2025-05-07T20:33:02.8823775Z     scale_ub=1200.0,
2025-05-07T20:33:02.8823990Z     contiguous=False,
2025-05-07T20:33:02.8824213Z     compiled=True,
2025-05-07T20:33:02.8824402Z )
2025-05-07T20:33:03.0004240Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.0005079Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:03.0005515Z 
2025-05-07T20:33:03.0005623Z     @given(
2025-05-07T20:33:03.0005931Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.0006363Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.0006855Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.0007312Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.0007645Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.0007925Z     )
2025-05-07T20:33:03.0008276Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.0008723Z     def test_silu_mul_quant(
2025-05-07T20:33:03.0008964Z         self,
2025-05-07T20:33:03.0009157Z         T: int,
2025-05-07T20:33:03.0009348Z         D: int,
2025-05-07T20:33:03.0009566Z         scale_ub: Optional[float],
2025-05-07T20:33:03.0009833Z         contiguous: bool,
2025-05-07T20:33:03.0010079Z         compiled: bool,
2025-05-07T20:33:03.0010307Z     ) -> None:
2025-05-07T20:33:03.0010543Z         torch.manual_seed(2025)
2025-05-07T20:33:03.0010812Z     
2025-05-07T20:33:03.0011084Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.0011426Z     
2025-05-07T20:33:03.0011705Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.0012002Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.0012308Z         x = x_sign * x_clamp
2025-05-07T20:33:03.0012546Z         x0 = x[:, :D]
2025-05-07T20:33:03.0012764Z         x1 = x[:, D:]
2025-05-07T20:33:03.0012967Z     
2025-05-07T20:33:03.0013152Z         if contiguous:
2025-05-07T20:33:03.0013385Z             x0 = x0.contiguous()
2025-05-07T20:33:03.0013642Z             x1 = x1.contiguous()
2025-05-07T20:33:03.0013886Z     
2025-05-07T20:33:03.0014080Z         if scale_ub is not None:
2025-05-07T20:33:03.0014354Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.0014688Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.0014997Z             )
2025-05-07T20:33:03.0015186Z         else:
2025-05-07T20:33:03.0015393Z             scale_ub_tensor = None
2025-05-07T20:33:03.0015645Z     
2025-05-07T20:33:03.0015875Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.0016192Z             op = silu_mul_quant
2025-05-07T20:33:03.0016444Z             if compiled:
2025-05-07T20:33:03.0016691Z                 op = torch.compile(op)
2025-05-07T20:33:03.0016985Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.0017260Z     
2025-05-07T20:33:03.0017452Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.0017620Z 
2025-05-07T20:33:03.0017721Z moe/activation_test.py:117: 
2025-05-07T20:33:03.0018017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.0018347Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.0018628Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.0019187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.0019748Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.0020407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.0021146Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.0021686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.0022366Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.0023021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.0023616Z     kernel = self.compile(
2025-05-07T20:33:03.0024152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.0024839Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.0025235Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.0025461Z 
2025-05-07T20:33:03.0025668Z self = <triton.compiler.compiler.ASTSource object at 0x7f891793d370>
2025-05-07T20:33:03.0026820Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.0028200Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89180401f0>}
2025-05-07T20:33:03.0029548Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.0030643Z context = <triton._C.libtriton.ir.context object at 0x7f8918047070>
2025-05-07T20:33:03.0030932Z 
2025-05-07T20:33:03.0031095Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.0031665Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.0032141Z                            module_map=module_map)
2025-05-07T20:33:03.0032506Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.0032860Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.0033121Z E       ^
2025-05-07T20:33:03.0033612Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.0034068Z 
2025-05-07T20:33:03.0034492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.0040882Z 
2025-05-07T20:33:03.0041014Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.0041438Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.0041835Z     T=2048,
2025-05-07T20:33:03.0042028Z     D=7168,
2025-05-07T20:33:03.0042217Z     scale_ub=1200.0,
2025-05-07T20:33:03.0042437Z     contiguous=False,
2025-05-07T20:33:03.0042660Z     compiled=True,
2025-05-07T20:33:03.0042858Z )
2025-05-07T20:33:03.0043174Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.0043665Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:03.0043940Z 
2025-05-07T20:33:03.0044018Z     @given(
2025-05-07T20:33:03.0044247Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.0044555Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.0044862Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.0045189Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.0045513Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.0045800Z     )
2025-05-07T20:33:03.0046151Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.0046583Z     def test_silu_mul_quant(
2025-05-07T20:33:03.0046821Z         self,
2025-05-07T20:33:03.0047019Z         T: int,
2025-05-07T20:33:03.0047209Z         D: int,
2025-05-07T20:33:03.0047426Z         scale_ub: Optional[float],
2025-05-07T20:33:03.0047693Z         contiguous: bool,
2025-05-07T20:33:03.0047930Z         compiled: bool,
2025-05-07T20:33:03.0048151Z     ) -> None:
2025-05-07T20:33:03.0048363Z         torch.manual_seed(2025)
2025-05-07T20:33:03.0048679Z     
2025-05-07T20:33:03.0048944Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.0049286Z     
2025-05-07T20:33:03.0049485Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.0049776Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.0050129Z         x = x_sign * x_clamp
2025-05-07T20:33:03.0050367Z         x0 = x[:, :D]
2025-05-07T20:33:03.0050578Z         x1 = x[:, D:]
2025-05-07T20:33:03.0050803Z     
2025-05-07T20:33:03.0051012Z         if contiguous:
2025-05-07T20:33:03.0051236Z             x0 = x0.contiguous()
2025-05-07T20:33:03.0051495Z             x1 = x1.contiguous()
2025-05-07T20:33:03.0051775Z     
2025-05-07T20:33:03.0051957Z         if scale_ub is not None:
2025-05-07T20:33:03.0052224Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.0052555Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.0052857Z             )
2025-05-07T20:33:03.0053051Z         else:
2025-05-07T20:33:03.0053259Z             scale_ub_tensor = None
2025-05-07T20:33:03.0053505Z     
2025-05-07T20:33:03.0053726Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.0054038Z             op = silu_mul_quant
2025-05-07T20:33:03.0054284Z             if compiled:
2025-05-07T20:33:03.0054529Z                 op = torch.compile(op)
2025-05-07T20:33:03.0054824Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.0055092Z     
2025-05-07T20:33:03.0055274Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.0055437Z 
2025-05-07T20:33:03.0055535Z moe/activation_test.py:117: 
2025-05-07T20:33:03.0055871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.0056196Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.0056471Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.0057023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.0057573Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.0058231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.0058913Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.0059447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.0060120Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.0060825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.0061357Z     kernel = self.compile(
2025-05-07T20:33:03.0061891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.0062548Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.0062934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.0063165Z 
2025-05-07T20:33:03.0063369Z self = <triton.compiler.compiler.ASTSource object at 0x7f891801ac70>
2025-05-07T20:33:03.0064462Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.0065849Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8918040ee0>}
2025-05-07T20:33:03.0067202Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.0068235Z context = <triton._C.libtriton.ir.context object at 0x7f89178e3e30>
2025-05-07T20:33:03.0068525Z 
2025-05-07T20:33:03.0068744Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.0069270Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.0069733Z                            module_map=module_map)
2025-05-07T20:33:03.0070211Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.0070595Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.0070874Z E       ^
2025-05-07T20:33:03.0071345Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.0071841Z 
2025-05-07T20:33:03.0072256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.0072767Z 
2025-05-07T20:33:03.2731105Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.2731727Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.2732309Z     T=1,
2025-05-07T20:33:03.2732553Z     D=5120,
2025-05-07T20:33:03.2732805Z     scale_ub=None,
2025-05-07T20:33:03.2733077Z     contiguous=False,
2025-05-07T20:33:03.2733293Z     compiled=False,
2025-05-07T20:33:03.2733490Z )
2025-05-07T20:33:03.2733800Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.2734283Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:03.2734546Z 
2025-05-07T20:33:03.2734622Z     @given(
2025-05-07T20:33:03.2734842Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.2735273Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.2735579Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.2735902Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.2736253Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.2736524Z     )
2025-05-07T20:33:03.2736863Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.2737301Z     def test_silu_mul_quant(
2025-05-07T20:33:03.2737535Z         self,
2025-05-07T20:33:03.2737713Z         T: int,
2025-05-07T20:33:03.2737900Z         D: int,
2025-05-07T20:33:03.2738111Z         scale_ub: Optional[float],
2025-05-07T20:33:03.2738371Z         contiguous: bool,
2025-05-07T20:33:03.2738604Z         compiled: bool,
2025-05-07T20:33:03.2738818Z     ) -> None:
2025-05-07T20:33:03.2739025Z         torch.manual_seed(2025)
2025-05-07T20:33:03.2739263Z     
2025-05-07T20:33:03.2739532Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.2739869Z     
2025-05-07T20:33:03.2740055Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.2740347Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.2740644Z         x = x_sign * x_clamp
2025-05-07T20:33:03.2740916Z         x0 = x[:, :D]
2025-05-07T20:33:03.2741131Z         x1 = x[:, D:]
2025-05-07T20:33:03.2741326Z     
2025-05-07T20:33:03.2741507Z         if contiguous:
2025-05-07T20:33:03.2741732Z             x0 = x0.contiguous()
2025-05-07T20:33:03.2741977Z             x1 = x1.contiguous()
2025-05-07T20:33:03.2742210Z     
2025-05-07T20:33:03.2742389Z         if scale_ub is not None:
2025-05-07T20:33:03.2742656Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.2742983Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.2743286Z             )
2025-05-07T20:33:03.2743472Z         else:
2025-05-07T20:33:03.2743669Z             scale_ub_tensor = None
2025-05-07T20:33:03.2743911Z     
2025-05-07T20:33:03.2744148Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.2744456Z             op = silu_mul_quant
2025-05-07T20:33:03.2744707Z             if compiled:
2025-05-07T20:33:03.2744947Z                 op = torch.compile(op)
2025-05-07T20:33:03.2745233Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.2745508Z     
2025-05-07T20:33:03.2745692Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.2745923Z 
2025-05-07T20:33:03.2746017Z moe/activation_test.py:117: 
2025-05-07T20:33:03.2746307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.2746629Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.2746962Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.2747645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.2748332Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.2748870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.2749602Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.2750339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.2750914Z     kernel = self.compile(
2025-05-07T20:33:03.2751458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.2752094Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.2752485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.2752707Z 
2025-05-07T20:33:03.2752914Z self = <triton.compiler.compiler.ASTSource object at 0x7f89178cf370>
2025-05-07T20:33:03.2754041Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.2755415Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89178db5e0>}
2025-05-07T20:33:03.2756751Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.2757765Z context = <triton._C.libtriton.ir.context object at 0x7f8918092730>
2025-05-07T20:33:03.2758047Z 
2025-05-07T20:33:03.2758212Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.2758720Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.2759181Z                            module_map=module_map)
2025-05-07T20:33:03.2759543Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.2759890Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.2760136Z E       ^
2025-05-07T20:33:03.2760597Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.2761046Z 
2025-05-07T20:33:03.2761462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.2761973Z 
2025-05-07T20:33:03.2762073Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.2762474Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.2762868Z     T=4096,
2025-05-07T20:33:03.2763044Z     D=7168,
2025-05-07T20:33:03.2763227Z     scale_ub=1200.0,
2025-05-07T20:33:03.2763444Z     contiguous=False,
2025-05-07T20:33:03.2763660Z     compiled=False,
2025-05-07T20:33:03.2763849Z )
2025-05-07T20:33:03.2764159Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.2764648Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:03.2764919Z 
2025-05-07T20:33:03.2764992Z     @given(
2025-05-07T20:33:03.2765214Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.2765517Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.2765866Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.2766183Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.2766504Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.2766782Z     )
2025-05-07T20:33:03.2767157Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.2767587Z     def test_silu_mul_quant(
2025-05-07T20:33:03.2767823Z         self,
2025-05-07T20:33:03.2768002Z         T: int,
2025-05-07T20:33:03.2768194Z         D: int,
2025-05-07T20:33:03.2768408Z         scale_ub: Optional[float],
2025-05-07T20:33:03.2768733Z         contiguous: bool,
2025-05-07T20:33:03.2768969Z         compiled: bool,
2025-05-07T20:33:03.2769186Z     ) -> None:
2025-05-07T20:33:03.2769395Z         torch.manual_seed(2025)
2025-05-07T20:33:03.2769626Z     
2025-05-07T20:33:03.2769889Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.2770227Z     
2025-05-07T20:33:03.2770417Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.2770729Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.2771043Z         x = x_sign * x_clamp
2025-05-07T20:33:03.2771272Z         x0 = x[:, :D]
2025-05-07T20:33:03.2771477Z         x1 = x[:, D:]
2025-05-07T20:33:03.2771678Z     
2025-05-07T20:33:03.2771852Z         if contiguous:
2025-05-07T20:33:03.2772072Z             x0 = x0.contiguous()
2025-05-07T20:33:03.2772327Z             x1 = x1.contiguous()
2025-05-07T20:33:03.2772558Z     
2025-05-07T20:33:03.2772736Z         if scale_ub is not None:
2025-05-07T20:33:03.2773048Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.2773375Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.2773668Z             )
2025-05-07T20:33:03.2773857Z         else:
2025-05-07T20:33:03.2774061Z             scale_ub_tensor = None
2025-05-07T20:33:03.2774297Z     
2025-05-07T20:33:03.2774523Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.2774828Z             op = silu_mul_quant
2025-05-07T20:33:03.2775066Z             if compiled:
2025-05-07T20:33:03.2775299Z                 op = torch.compile(op)
2025-05-07T20:33:03.2775595Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.2775862Z     
2025-05-07T20:33:03.2776038Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.2776202Z 
2025-05-07T20:33:03.2776298Z moe/activation_test.py:117: 
2025-05-07T20:33:03.2776585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.2776901Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.2777182Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.2777869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.2778551Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.2779075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.2779748Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.2780398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.2780920Z     kernel = self.compile(
2025-05-07T20:33:03.2781450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.2782092Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.2782480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.2782706Z 
2025-05-07T20:33:03.2782905Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917afee20>
2025-05-07T20:33:03.2783981Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.2785402Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917f8b1f0>}
2025-05-07T20:33:03.2786788Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.2787813Z context = <triton._C.libtriton.ir.context object at 0x7f8917f83af0>
2025-05-07T20:33:03.2788137Z 
2025-05-07T20:33:03.2788302Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.2788817Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.2789284Z                            module_map=module_map)
2025-05-07T20:33:03.2789643Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.2790036Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.2790286Z E       ^
2025-05-07T20:33:03.2790741Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.2791197Z 
2025-05-07T20:33:03.2791614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.2792133Z 
2025-05-07T20:33:03.2792234Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.2792679Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.2793083Z     T=16384,
2025-05-07T20:33:03.2793264Z     D=7168,
2025-05-07T20:33:03.2793449Z     scale_ub=None,
2025-05-07T20:33:03.2793649Z     contiguous=True,
2025-05-07T20:33:03.2793864Z     compiled=True,
2025-05-07T20:33:03.2794060Z )
2025-05-07T20:33:03.5626025Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5626802Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:03.5627179Z 
2025-05-07T20:33:03.5627279Z     @given(
2025-05-07T20:33:03.5627582Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5627954Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5628289Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5628654Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5629019Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5629330Z     )
2025-05-07T20:33:03.5629690Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5630189Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5630423Z         self,
2025-05-07T20:33:03.5630624Z         T: int,
2025-05-07T20:33:03.5630852Z         D: int,
2025-05-07T20:33:03.5631071Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5631341Z         contiguous: bool,
2025-05-07T20:33:03.5631580Z         compiled: bool,
2025-05-07T20:33:03.5631797Z     ) -> None:
2025-05-07T20:33:03.5632008Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5632244Z     
2025-05-07T20:33:03.5632507Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5632850Z     
2025-05-07T20:33:03.5633037Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5633322Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5633621Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5633856Z         x0 = x[:, :D]
2025-05-07T20:33:03.5634064Z         x1 = x[:, D:]
2025-05-07T20:33:03.5634270Z     
2025-05-07T20:33:03.5634451Z         if contiguous:
2025-05-07T20:33:03.5634675Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5634925Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5635162Z     
2025-05-07T20:33:03.5635349Z         if scale_ub is not None:
2025-05-07T20:33:03.5635615Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5636085Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5636395Z             )
2025-05-07T20:33:03.5636577Z         else:
2025-05-07T20:33:03.5636789Z             scale_ub_tensor = None
2025-05-07T20:33:03.5637039Z     
2025-05-07T20:33:03.5637332Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5637645Z             op = silu_mul_quant
2025-05-07T20:33:03.5637901Z             if compiled:
2025-05-07T20:33:03.5638155Z                 op = torch.compile(op)
2025-05-07T20:33:03.5638448Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5638789Z     
2025-05-07T20:33:03.5638981Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5639149Z 
2025-05-07T20:33:03.5639248Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5639539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5639873Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5640151Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5640704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.5641303Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.5641969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5642643Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5643174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5643912Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5644563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5645091Z     kernel = self.compile(
2025-05-07T20:33:03.5645631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5646285Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5646666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5646899Z 
2025-05-07T20:33:03.5647105Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917f874c0>
2025-05-07T20:33:03.5648177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5649562Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917f8bee0>}
2025-05-07T20:33:03.5650897Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5651906Z context = <triton._C.libtriton.ir.context object at 0x7f8917d5fd70>
2025-05-07T20:33:03.5652192Z 
2025-05-07T20:33:03.5652359Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5652880Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5653341Z                            module_map=module_map)
2025-05-07T20:33:03.5653696Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5654043Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5654303Z E       ^
2025-05-07T20:33:03.5654764Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5655216Z 
2025-05-07T20:33:03.5655628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5656185Z 
2025-05-07T20:33:03.5656286Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5656696Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5657095Z     T=4096,
2025-05-07T20:33:03.5657278Z     D=5120,
2025-05-07T20:33:03.5657503Z     scale_ub=None,
2025-05-07T20:33:03.5657709Z     contiguous=False,
2025-05-07T20:33:03.5657939Z     compiled=True,
2025-05-07T20:33:03.5658139Z )
2025-05-07T20:33:03.5658457Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5658947Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:03.5659285Z 
2025-05-07T20:33:03.5659364Z     @given(
2025-05-07T20:33:03.5659589Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5659896Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5660208Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5660535Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5660884Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5661195Z     )
2025-05-07T20:33:03.5661537Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5661970Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5662216Z         self,
2025-05-07T20:33:03.5662407Z         T: int,
2025-05-07T20:33:03.5662596Z         D: int,
2025-05-07T20:33:03.5662813Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5663079Z         contiguous: bool,
2025-05-07T20:33:03.5663309Z         compiled: bool,
2025-05-07T20:33:03.5663528Z     ) -> None:
2025-05-07T20:33:03.5663791Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5664032Z     
2025-05-07T20:33:03.5664295Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5664629Z     
2025-05-07T20:33:03.5664821Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5665102Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5665412Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5665651Z         x0 = x[:, :D]
2025-05-07T20:33:03.5665857Z         x1 = x[:, D:]
2025-05-07T20:33:03.5666058Z     
2025-05-07T20:33:03.5666239Z         if contiguous:
2025-05-07T20:33:03.5666463Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5666721Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5666956Z     
2025-05-07T20:33:03.5667140Z         if scale_ub is not None:
2025-05-07T20:33:03.5667412Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5673784Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5674122Z             )
2025-05-07T20:33:03.5674309Z         else:
2025-05-07T20:33:03.5674522Z             scale_ub_tensor = None
2025-05-07T20:33:03.5674778Z     
2025-05-07T20:33:03.5675013Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5675329Z             op = silu_mul_quant
2025-05-07T20:33:03.5675582Z             if compiled:
2025-05-07T20:33:03.5675831Z                 op = torch.compile(op)
2025-05-07T20:33:03.5676125Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5676397Z     
2025-05-07T20:33:03.5676588Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5676754Z 
2025-05-07T20:33:03.5676855Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5677156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5677485Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5677760Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5678335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.5678894Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.5679550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5680227Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5680762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5681513Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5682216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5682741Z     kernel = self.compile(
2025-05-07T20:33:03.5683283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5683937Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5684372Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5684602Z 
2025-05-07T20:33:03.5684806Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917d7a550>
2025-05-07T20:33:03.5685893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5687289Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917d7e940>}
2025-05-07T20:33:03.5688638Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5689699Z context = <triton._C.libtriton.ir.context object at 0x7f8917b27df0>
2025-05-07T20:33:03.5689991Z 
2025-05-07T20:33:03.5690158Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5690684Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5691200Z                            module_map=module_map)
2025-05-07T20:33:03.5691571Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5691925Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5692183Z E       ^
2025-05-07T20:33:03.5692645Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5693103Z 
2025-05-07T20:33:03.5693526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5694044Z 
2025-05-07T20:33:03.7639379Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.7640012Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.7640586Z     T=4096,
2025-05-07T20:33:03.7640847Z     D=5120,
2025-05-07T20:33:03.7641150Z     scale_ub=1200.0,
2025-05-07T20:33:03.7641444Z     contiguous=False,
2025-05-07T20:33:03.7641745Z     compiled=False,
2025-05-07T20:33:03.7642009Z )
2025-05-07T20:33:03.7642324Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.7642829Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:03.7643112Z 
2025-05-07T20:33:03.7643192Z     @given(
2025-05-07T20:33:03.7643452Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.7643763Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.7644069Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.7644401Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.7644725Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.7645015Z     )
2025-05-07T20:33:03.7645362Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.7645804Z     def test_silu_mul_quant(
2025-05-07T20:33:03.7646047Z         self,
2025-05-07T20:33:03.7646236Z         T: int,
2025-05-07T20:33:03.7646427Z         D: int,
2025-05-07T20:33:03.7646643Z         scale_ub: Optional[float],
2025-05-07T20:33:03.7647032Z         contiguous: bool,
2025-05-07T20:33:03.7647266Z         compiled: bool,
2025-05-07T20:33:03.7647488Z     ) -> None:
2025-05-07T20:33:03.7647703Z         torch.manual_seed(2025)
2025-05-07T20:33:03.7647936Z     
2025-05-07T20:33:03.7648272Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.7648617Z     
2025-05-07T20:33:03.7648809Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.7649101Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.7649413Z         x = x_sign * x_clamp
2025-05-07T20:33:03.7649646Z         x0 = x[:, :D]
2025-05-07T20:33:03.7649925Z         x1 = x[:, D:]
2025-05-07T20:33:03.7650135Z     
2025-05-07T20:33:03.7650318Z         if contiguous:
2025-05-07T20:33:03.7650561Z             x0 = x0.contiguous()
2025-05-07T20:33:03.7650824Z             x1 = x1.contiguous()
2025-05-07T20:33:03.7651063Z     
2025-05-07T20:33:03.7651247Z         if scale_ub is not None:
2025-05-07T20:33:03.7651527Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.7651867Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.7652173Z             )
2025-05-07T20:33:03.7652359Z         else:
2025-05-07T20:33:03.7652568Z             scale_ub_tensor = None
2025-05-07T20:33:03.7652814Z     
2025-05-07T20:33:03.7653046Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.7653359Z             op = silu_mul_quant
2025-05-07T20:33:03.7653606Z             if compiled:
2025-05-07T20:33:03.7653851Z                 op = torch.compile(op)
2025-05-07T20:33:03.7654211Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.7654487Z     
2025-05-07T20:33:03.7654679Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.7654848Z 
2025-05-07T20:33:03.7654954Z moe/activation_test.py:117: 
2025-05-07T20:33:03.7655252Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.7655579Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.7655868Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.7656568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.7657256Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.7657805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.7658487Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.7659148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.7659671Z     kernel = self.compile(
2025-05-07T20:33:03.7660205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.7660883Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.7661299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.7661530Z 
2025-05-07T20:33:03.7661733Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917d64250>
2025-05-07T20:33:03.7662822Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.7664206Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917bae3a0>}
2025-05-07T20:33:03.7665558Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.7666579Z context = <triton._C.libtriton.ir.context object at 0x7f8917bbf670>
2025-05-07T20:33:03.7666924Z 
2025-05-07T20:33:03.7667088Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.7667611Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.7668121Z                            module_map=module_map)
2025-05-07T20:33:03.7668479Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.7668836Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.7669097Z E       ^
2025-05-07T20:33:03.7669563Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.7670173Z 
2025-05-07T20:33:03.7670590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.7671141Z 
2025-05-07T20:33:03.7671257Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.7671666Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.7672064Z     T=4096,
2025-05-07T20:33:03.7672246Z     D=5120,
2025-05-07T20:33:03.7672434Z     scale_ub=1200.0,
2025-05-07T20:33:03.7672651Z     contiguous=False,
2025-05-07T20:33:03.7672875Z     compiled=True,
2025-05-07T20:33:03.7673077Z )
2025-05-07T20:33:03.7673393Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.7673887Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:03.7674164Z 
2025-05-07T20:33:03.7674244Z     @given(
2025-05-07T20:33:03.7674521Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.7674833Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.7675140Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.7675465Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.7675789Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.7676075Z     )
2025-05-07T20:33:03.7676425Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.7676867Z     def test_silu_mul_quant(
2025-05-07T20:33:03.7677109Z         self,
2025-05-07T20:33:03.7677295Z         T: int,
2025-05-07T20:33:03.7677491Z         D: int,
2025-05-07T20:33:03.7677713Z         scale_ub: Optional[float],
2025-05-07T20:33:03.7677985Z         contiguous: bool,
2025-05-07T20:33:03.7678218Z         compiled: bool,
2025-05-07T20:33:03.7678449Z     ) -> None:
2025-05-07T20:33:03.7678665Z         torch.manual_seed(2025)
2025-05-07T20:33:03.7678900Z     
2025-05-07T20:33:03.7679177Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.7679513Z     
2025-05-07T20:33:03.7679707Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.7680000Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.7680304Z         x = x_sign * x_clamp
2025-05-07T20:33:03.7680540Z         x0 = x[:, :D]
2025-05-07T20:33:03.7680747Z         x1 = x[:, D:]
2025-05-07T20:33:03.7680947Z     
2025-05-07T20:33:03.7681121Z         if contiguous:
2025-05-07T20:33:03.7681335Z             x0 = x0.contiguous()
2025-05-07T20:33:03.7681591Z             x1 = x1.contiguous()
2025-05-07T20:33:03.7681823Z     
2025-05-07T20:33:03.7682005Z         if scale_ub is not None:
2025-05-07T20:33:03.7682277Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.7682607Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.7682905Z             )
2025-05-07T20:33:03.7683090Z         else:
2025-05-07T20:33:03.7683294Z             scale_ub_tensor = None
2025-05-07T20:33:03.7683540Z     
2025-05-07T20:33:03.7683769Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.7684075Z             op = silu_mul_quant
2025-05-07T20:33:03.7684321Z             if compiled:
2025-05-07T20:33:03.7684557Z                 op = torch.compile(op)
2025-05-07T20:33:03.7684848Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.7685167Z     
2025-05-07T20:33:03.7685345Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.7685510Z 
2025-05-07T20:33:03.7685603Z moe/activation_test.py:117: 
2025-05-07T20:33:03.7685891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.7686266Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.7686551Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.7687110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.7687664Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.7688384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.7689067Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.7689596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.7690268Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.7690980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.7691503Z     kernel = self.compile(
2025-05-07T20:33:03.7692044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.7692700Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.7693087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.7693320Z 
2025-05-07T20:33:03.7693566Z self = <triton.compiler.compiler.ASTSource object at 0x7f89179a2f70>
2025-05-07T20:33:03.7694657Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.7696046Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917bae280>}
2025-05-07T20:33:03.7697400Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.7698417Z context = <triton._C.libtriton.ir.context object at 0x7f8917b4ecb0>
2025-05-07T20:33:03.7698705Z 
2025-05-07T20:33:03.7698871Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.7699394Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.7699858Z                            module_map=module_map)
2025-05-07T20:33:03.7700217Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.7700569Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.7700830Z E       ^
2025-05-07T20:33:03.7701334Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.7701791Z 
2025-05-07T20:33:03.7702207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.7702724Z 
2025-05-07T20:33:04.0467313Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.0467846Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.0468453Z     T=2048,
2025-05-07T20:33:04.0468715Z     D=7168,
2025-05-07T20:33:04.0468986Z     scale_ub=1200.0,
2025-05-07T20:33:04.0469281Z     contiguous=False,
2025-05-07T20:33:04.0469577Z     compiled=False,
2025-05-07T20:33:04.0469952Z )
2025-05-07T20:33:04.0470315Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.0470810Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:04.0471204Z 
2025-05-07T20:33:04.0471290Z     @given(
2025-05-07T20:33:04.0471514Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.0471814Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.0472183Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.0472519Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.0472838Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.0473121Z     )
2025-05-07T20:33:04.0473468Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.0473963Z     def test_silu_mul_quant(
2025-05-07T20:33:04.0474201Z         self,
2025-05-07T20:33:04.0474398Z         T: int,
2025-05-07T20:33:04.0474588Z         D: int,
2025-05-07T20:33:04.0474808Z         scale_ub: Optional[float],
2025-05-07T20:33:04.0475075Z         contiguous: bool,
2025-05-07T20:33:04.0475308Z         compiled: bool,
2025-05-07T20:33:04.0475533Z     ) -> None:
2025-05-07T20:33:04.0475751Z         torch.manual_seed(2025)
2025-05-07T20:33:04.0475984Z     
2025-05-07T20:33:04.0476256Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.0476597Z     
2025-05-07T20:33:04.0476793Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.0477077Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.0477391Z         x = x_sign * x_clamp
2025-05-07T20:33:04.0477628Z         x0 = x[:, :D]
2025-05-07T20:33:04.0477836Z         x1 = x[:, D:]
2025-05-07T20:33:04.0478041Z     
2025-05-07T20:33:04.0478222Z         if contiguous:
2025-05-07T20:33:04.0478518Z             x0 = x0.contiguous()
2025-05-07T20:33:04.0478778Z             x1 = x1.contiguous()
2025-05-07T20:33:04.0479013Z     
2025-05-07T20:33:04.0479198Z         if scale_ub is not None:
2025-05-07T20:33:04.0479470Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.0479804Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.0480107Z             )
2025-05-07T20:33:04.0480295Z         else:
2025-05-07T20:33:04.0480510Z             scale_ub_tensor = None
2025-05-07T20:33:04.0480778Z     
2025-05-07T20:33:04.0481032Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.0481356Z             op = silu_mul_quant
2025-05-07T20:33:04.0481605Z             if compiled:
2025-05-07T20:33:04.0481845Z                 op = torch.compile(op)
2025-05-07T20:33:04.0482140Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.0482412Z     
2025-05-07T20:33:04.0482595Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.0482770Z 
2025-05-07T20:33:04.0482869Z moe/activation_test.py:117: 
2025-05-07T20:33:04.0483162Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.0483486Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.0483764Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.0484451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.0485142Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.0485669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.0486351Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.0487013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.0487539Z     kernel = self.compile(
2025-05-07T20:33:04.0488081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.0488737Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.0489131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.0489355Z 
2025-05-07T20:33:04.0489557Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917e775e0>
2025-05-07T20:33:04.0490735Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.0492180Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917b7d670>}
2025-05-07T20:33:04.0493533Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.0494595Z context = <triton._C.libtriton.ir.context object at 0x7f8917a609b0>
2025-05-07T20:33:04.0494880Z 
2025-05-07T20:33:04.0495043Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.0495564Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.0496034Z                            module_map=module_map)
2025-05-07T20:33:04.0496392Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.0496748Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.0497004Z E       ^
2025-05-07T20:33:04.0497464Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.0497914Z 
2025-05-07T20:33:04.0498370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.0498895Z 
2025-05-07T20:33:04.0498998Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.0499409Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.0499808Z     T=1,
2025-05-07T20:33:04.0499985Z     D=7168,
2025-05-07T20:33:04.0500177Z     scale_ub=None,
2025-05-07T20:33:04.0500384Z     contiguous=True,
2025-05-07T20:33:04.0500610Z     compiled=False,
2025-05-07T20:33:04.0500814Z )
2025-05-07T20:33:04.0501127Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.0501608Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:04.0501874Z 
2025-05-07T20:33:04.0501950Z     @given(
2025-05-07T20:33:04.0502176Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.0502479Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.0502794Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.0503123Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.0503444Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.0503916Z     )
2025-05-07T20:33:04.0504268Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.0504707Z     def test_silu_mul_quant(
2025-05-07T20:33:04.0504941Z         self,
2025-05-07T20:33:04.0505132Z         T: int,
2025-05-07T20:33:04.0505324Z         D: int,
2025-05-07T20:33:04.0505533Z         scale_ub: Optional[float],
2025-05-07T20:33:04.0505803Z         contiguous: bool,
2025-05-07T20:33:04.0506039Z         compiled: bool,
2025-05-07T20:33:04.0506256Z     ) -> None:
2025-05-07T20:33:04.0506470Z         torch.manual_seed(2025)
2025-05-07T20:33:04.0506709Z     
2025-05-07T20:33:04.0506993Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.0507329Z     
2025-05-07T20:33:04.0507522Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.0507814Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.0508119Z         x = x_sign * x_clamp
2025-05-07T20:33:04.0508365Z         x0 = x[:, :D]
2025-05-07T20:33:04.0508585Z         x1 = x[:, D:]
2025-05-07T20:33:04.0508782Z     
2025-05-07T20:33:04.0508964Z         if contiguous:
2025-05-07T20:33:04.0509192Z             x0 = x0.contiguous()
2025-05-07T20:33:04.0509923Z             x1 = x1.contiguous()
2025-05-07T20:33:04.0510163Z     
2025-05-07T20:33:04.0510350Z         if scale_ub is not None:
2025-05-07T20:33:04.0510622Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.0511063Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.0511369Z             )
2025-05-07T20:33:04.0511558Z         else:
2025-05-07T20:33:04.0511760Z             scale_ub_tensor = None
2025-05-07T20:33:04.0512008Z     
2025-05-07T20:33:04.0512235Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.0512605Z             op = silu_mul_quant
2025-05-07T20:33:04.0512854Z             if compiled:
2025-05-07T20:33:04.0513096Z                 op = torch.compile(op)
2025-05-07T20:33:04.0513384Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.0513655Z     
2025-05-07T20:33:04.0513843Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.0514006Z 
2025-05-07T20:33:04.0514103Z moe/activation_test.py:117: 
2025-05-07T20:33:04.0514399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.0514726Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.0515005Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.0515692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.0516377Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.0516910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.0517649Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.0518314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.0518843Z     kernel = self.compile(
2025-05-07T20:33:04.0519380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.0520027Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.0520419Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.0520647Z 
2025-05-07T20:33:04.0520860Z self = <triton.compiler.compiler.ASTSource object at 0x7f89179dd8b0>
2025-05-07T20:33:04.0522006Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.0523384Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89177e0280>}
2025-05-07T20:33:04.0524726Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.0525746Z context = <triton._C.libtriton.ir.context object at 0x7f8917807970>
2025-05-07T20:33:04.0526030Z 
2025-05-07T20:33:04.0526199Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.0526719Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.0527186Z                            module_map=module_map)
2025-05-07T20:33:04.0527546Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.0527910Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.0528164Z E       ^
2025-05-07T20:33:04.0528628Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.0529080Z 
2025-05-07T20:33:04.0529502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.0530061Z 
2025-05-07T20:33:04.0530168Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.0530579Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.0530982Z     T=16384,
2025-05-07T20:33:04.0537232Z     D=7168,
2025-05-07T20:33:04.0537467Z     scale_ub=1200.0,
2025-05-07T20:33:04.0537693Z     contiguous=False,
2025-05-07T20:33:04.0537921Z     compiled=True,
2025-05-07T20:33:04.0538125Z )
2025-05-07T20:33:04.2434136Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.2435019Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:04.2435423Z 
2025-05-07T20:33:04.2435527Z     @given(
2025-05-07T20:33:04.2435836Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.2436257Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.2436630Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.2436965Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.2437292Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.2437575Z     )
2025-05-07T20:33:04.2437926Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.2438370Z     def test_silu_mul_quant(
2025-05-07T20:33:04.2438612Z         self,
2025-05-07T20:33:04.2438798Z         T: int,
2025-05-07T20:33:04.2438996Z         D: int,
2025-05-07T20:33:04.2439211Z         scale_ub: Optional[float],
2025-05-07T20:33:04.2439474Z         contiguous: bool,
2025-05-07T20:33:04.2439795Z         compiled: bool,
2025-05-07T20:33:04.2440022Z     ) -> None:
2025-05-07T20:33:04.2440233Z         torch.manual_seed(2025)
2025-05-07T20:33:04.2440476Z     
2025-05-07T20:33:04.2440765Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.2441131Z     
2025-05-07T20:33:04.2441325Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.2441611Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.2441918Z         x = x_sign * x_clamp
2025-05-07T20:33:04.2442155Z         x0 = x[:, :D]
2025-05-07T20:33:04.2442368Z         x1 = x[:, D:]
2025-05-07T20:33:04.2442567Z     
2025-05-07T20:33:04.2442753Z         if contiguous:
2025-05-07T20:33:04.2442986Z             x0 = x0.contiguous()
2025-05-07T20:33:04.2443240Z             x1 = x1.contiguous()
2025-05-07T20:33:04.2443476Z     
2025-05-07T20:33:04.2443664Z         if scale_ub is not None:
2025-05-07T20:33:04.2443933Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.2444263Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.2444573Z             )
2025-05-07T20:33:04.2444766Z         else:
2025-05-07T20:33:04.2444970Z             scale_ub_tensor = None
2025-05-07T20:33:04.2445218Z     
2025-05-07T20:33:04.2445446Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.2445751Z             op = silu_mul_quant
2025-05-07T20:33:04.2446002Z             if compiled:
2025-05-07T20:33:04.2446251Z                 op = torch.compile(op)
2025-05-07T20:33:04.2446546Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.2446816Z     
2025-05-07T20:33:04.2446998Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.2447167Z 
2025-05-07T20:33:04.2447267Z moe/activation_test.py:117: 
2025-05-07T20:33:04.2447561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.2447888Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.2448172Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.2448730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:04.2449292Z     return fn(*args, **kwargs)
2025-05-07T20:33:04.2449951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.2450639Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.2451298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.2451979Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.2452744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.2453275Z     kernel = self.compile(
2025-05-07T20:33:04.2453819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.2454510Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.2454906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.2455137Z 
2025-05-07T20:33:04.2455344Z self = <triton.compiler.compiler.ASTSource object at 0x7f89177eb6d0>
2025-05-07T20:33:04.2456436Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.2457838Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89177e0ee0>}
2025-05-07T20:33:04.2459192Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.2460257Z context = <triton._C.libtriton.ir.context object at 0x7f8917aa47f0>
2025-05-07T20:33:04.2460548Z 
2025-05-07T20:33:04.2460714Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.2461242Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.2461708Z                            module_map=module_map)
2025-05-07T20:33:04.2462073Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.2462425Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.2462679Z E       ^
2025-05-07T20:33:04.2463146Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.2463603Z 
2025-05-07T20:33:04.2464021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.2464536Z 
2025-05-07T20:33:04.2464648Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.2465060Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.2465460Z     T=1,
2025-05-07T20:33:04.2465645Z     D=7168,
2025-05-07T20:33:04.2465827Z     scale_ub=None,
2025-05-07T20:33:04.2466040Z     contiguous=False,
2025-05-07T20:33:04.2466264Z     compiled=False,
2025-05-07T20:33:04.2466460Z )
2025-05-07T20:33:04.2466775Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.2467262Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:04.2467524Z 
2025-05-07T20:33:04.2467602Z     @given(
2025-05-07T20:33:04.2467825Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.2468139Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.2468444Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.2468771Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.2469105Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.2469394Z     )
2025-05-07T20:33:04.2469736Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.2470246Z     def test_silu_mul_quant(
2025-05-07T20:33:04.2470482Z         self,
2025-05-07T20:33:04.2470667Z         T: int,
2025-05-07T20:33:04.2470886Z         D: int,
2025-05-07T20:33:04.2471175Z         scale_ub: Optional[float],
2025-05-07T20:33:04.2471463Z         contiguous: bool,
2025-05-07T20:33:04.2471698Z         compiled: bool,
2025-05-07T20:33:04.2471915Z     ) -> None:
2025-05-07T20:33:04.2472126Z         torch.manual_seed(2025)
2025-05-07T20:33:04.2472366Z     
2025-05-07T20:33:04.2472670Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.2473016Z     
2025-05-07T20:33:04.2473210Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.2473491Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.2473800Z         x = x_sign * x_clamp
2025-05-07T20:33:04.2474084Z         x0 = x[:, :D]
2025-05-07T20:33:04.2474293Z         x1 = x[:, D:]
2025-05-07T20:33:04.2474500Z     
2025-05-07T20:33:04.2474680Z         if contiguous:
2025-05-07T20:33:04.2474904Z             x0 = x0.contiguous()
2025-05-07T20:33:04.2475159Z             x1 = x1.contiguous()
2025-05-07T20:33:04.2475395Z     
2025-05-07T20:33:04.2475577Z         if scale_ub is not None:
2025-05-07T20:33:04.2475848Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.2476180Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.2476487Z             )
2025-05-07T20:33:04.2476678Z         else:
2025-05-07T20:33:04.2476891Z             scale_ub_tensor = None
2025-05-07T20:33:04.2477140Z     
2025-05-07T20:33:04.2477364Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.2477675Z             op = silu_mul_quant
2025-05-07T20:33:04.2477922Z             if compiled:
2025-05-07T20:33:04.2478164Z                 op = torch.compile(op)
2025-05-07T20:33:04.2478519Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.2478792Z     
2025-05-07T20:33:04.2478975Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.2479144Z 
2025-05-07T20:33:04.2479241Z moe/activation_test.py:117: 
2025-05-07T20:33:04.2479533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.2479862Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.2480142Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.2480836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.2481581Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.2482113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.2482793Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.2483460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.2483989Z     kernel = self.compile(
2025-05-07T20:33:04.2484530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.2485179Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.2485576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.2485802Z 
2025-05-07T20:33:04.2486007Z self = <triton.compiler.compiler.ASTSource object at 0x7f89177e9a30>
2025-05-07T20:33:04.2487103Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.2488498Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917abd670>}
2025-05-07T20:33:04.2489862Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.2490936Z context = <triton._C.libtriton.ir.context object at 0x7f89177c41b0>
2025-05-07T20:33:04.2491274Z 
2025-05-07T20:33:04.2491443Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.2491968Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.2492472Z                            module_map=module_map)
2025-05-07T20:33:04.2492832Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.2493188Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.2493445Z E       ^
2025-05-07T20:33:04.2493915Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.2494414Z 
2025-05-07T20:33:04.2494832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.2495353Z 
2025-05-07T20:33:04.2495455Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.2495868Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.2496264Z     T=2048,
2025-05-07T20:33:04.2496443Z     D=7168,
2025-05-07T20:33:04.2496629Z     scale_ub=None,
2025-05-07T20:33:04.2496848Z     contiguous=False,
2025-05-07T20:33:04.2497068Z     compiled=True,
2025-05-07T20:33:04.2497268Z )
2025-05-07T20:33:04.5363953Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.5365396Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:04.5366121Z 
2025-05-07T20:33:04.5366322Z     @given(
2025-05-07T20:33:04.5367069Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.5367693Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.5368282Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.5368926Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.5369576Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.5370124Z     )
2025-05-07T20:33:04.5370801Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.5371371Z     def test_silu_mul_quant(
2025-05-07T20:33:04.5371618Z         self,
2025-05-07T20:33:04.5371814Z         T: int,
2025-05-07T20:33:04.5372012Z         D: int,
2025-05-07T20:33:04.5372219Z         scale_ub: Optional[float],
2025-05-07T20:33:04.5372487Z         contiguous: bool,
2025-05-07T20:33:04.5372716Z         compiled: bool,
2025-05-07T20:33:04.5372929Z     ) -> None:
2025-05-07T20:33:04.5373135Z         torch.manual_seed(2025)
2025-05-07T20:33:04.5373378Z     
2025-05-07T20:33:04.5373646Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.5373987Z     
2025-05-07T20:33:04.5374170Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.5374453Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.5374749Z         x = x_sign * x_clamp
2025-05-07T20:33:04.5374985Z         x0 = x[:, :D]
2025-05-07T20:33:04.5375192Z         x1 = x[:, D:]
2025-05-07T20:33:04.5375389Z     
2025-05-07T20:33:04.5375565Z         if contiguous:
2025-05-07T20:33:04.5375785Z             x0 = x0.contiguous()
2025-05-07T20:33:04.5376033Z             x1 = x1.contiguous()
2025-05-07T20:33:04.5376264Z     
2025-05-07T20:33:04.5376451Z         if scale_ub is not None:
2025-05-07T20:33:04.5376715Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.5377050Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.5377347Z             )
2025-05-07T20:33:04.5377530Z         else:
2025-05-07T20:33:04.5377742Z             scale_ub_tensor = None
2025-05-07T20:33:04.5377986Z     
2025-05-07T20:33:04.5378206Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.5378515Z             op = silu_mul_quant
2025-05-07T20:33:04.5378759Z             if compiled:
2025-05-07T20:33:04.5378999Z                 op = torch.compile(op)
2025-05-07T20:33:04.5379283Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.5379618Z     
2025-05-07T20:33:04.5379796Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.5379955Z 
2025-05-07T20:33:04.5380049Z moe/activation_test.py:117: 
2025-05-07T20:33:04.5380409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.5380732Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.5381000Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.5381553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:04.5382106Z     return fn(*args, **kwargs)
2025-05-07T20:33:04.5382822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.5383496Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.5384025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.5384705Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.5385354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.5385879Z     kernel = self.compile(
2025-05-07T20:33:04.5386412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.5387058Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.5387482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.5387717Z 
2025-05-07T20:33:04.5387918Z self = <triton.compiler.compiler.ASTSource object at 0x7f891767b5e0>
2025-05-07T20:33:04.5388995Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.5390456Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917658550>}
2025-05-07T20:33:04.5391848Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.5392865Z context = <triton._C.libtriton.ir.context object at 0x7f8917a2a4b0>
2025-05-07T20:33:04.5393150Z 
2025-05-07T20:33:04.5393315Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.5393832Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.5394289Z                            module_map=module_map)
2025-05-07T20:33:04.5394649Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.5394997Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.5395246Z E       ^
2025-05-07T20:33:04.5395705Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.5396159Z 
2025-05-07T20:33:04.5396575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.5397082Z 
2025-05-07T20:33:04.5397182Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.5397588Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.5397989Z     T=4096,
2025-05-07T20:33:04.5398167Z     D=7168,
2025-05-07T20:33:04.5398357Z     scale_ub=None,
2025-05-07T20:33:04.5398557Z     contiguous=False,
2025-05-07T20:33:04.5398774Z     compiled=True,
2025-05-07T20:33:04.5398968Z )
2025-05-07T20:33:04.5399274Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.5399759Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:04.5400079Z 
2025-05-07T20:33:04.5400156Z     @given(
2025-05-07T20:33:04.5400373Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.5400677Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.5401091Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.5401438Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.5401756Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.5402035Z     )
2025-05-07T20:33:04.5402379Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.5402847Z     def test_silu_mul_quant(
2025-05-07T20:33:04.5403078Z         self,
2025-05-07T20:33:04.5403265Z         T: int,
2025-05-07T20:33:04.5403446Z         D: int,
2025-05-07T20:33:04.5403653Z         scale_ub: Optional[float],
2025-05-07T20:33:04.5404099Z         contiguous: bool,
2025-05-07T20:33:04.5404324Z         compiled: bool,
2025-05-07T20:33:04.5404545Z     ) -> None:
2025-05-07T20:33:04.5404751Z         torch.manual_seed(2025)
2025-05-07T20:33:04.5404984Z     
2025-05-07T20:33:04.5405245Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.5405575Z     
2025-05-07T20:33:04.5405761Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.5406042Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.5406350Z         x = x_sign * x_clamp
2025-05-07T20:33:04.5406580Z         x0 = x[:, :D]
2025-05-07T20:33:04.5406781Z         x1 = x[:, D:]
2025-05-07T20:33:04.5406979Z     
2025-05-07T20:33:04.5407226Z         if contiguous:
2025-05-07T20:33:04.5407448Z             x0 = x0.contiguous()
2025-05-07T20:33:04.5407695Z             x1 = x1.contiguous()
2025-05-07T20:33:04.5407925Z     
2025-05-07T20:33:04.5408104Z         if scale_ub is not None:
2025-05-07T20:33:04.5408366Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.5408690Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.5408988Z             )
2025-05-07T20:33:04.5409171Z         else:
2025-05-07T20:33:04.5409371Z             scale_ub_tensor = None
2025-05-07T20:33:04.5409609Z     
2025-05-07T20:33:04.5409835Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.5410141Z             op = silu_mul_quant
2025-05-07T20:33:04.5410380Z             if compiled:
2025-05-07T20:33:04.5410620Z                 op = torch.compile(op)
2025-05-07T20:33:04.5410907Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.5411173Z     
2025-05-07T20:33:04.5411355Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.5411519Z 
2025-05-07T20:33:04.5411616Z moe/activation_test.py:117: 
2025-05-07T20:33:04.5411906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.5412227Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.5412497Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.5413049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:04.5413596Z     return fn(*args, **kwargs)
2025-05-07T20:33:04.5414252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.5414930Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.5415455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.5416125Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.5416781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.5417306Z     kernel = self.compile(
2025-05-07T20:33:04.5417834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.5418483Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.5418967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.5419190Z 
2025-05-07T20:33:04.5419395Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917a39c10>
2025-05-07T20:33:04.5420532Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.5421971Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f891777b160>}
2025-05-07T20:33:04.5423370Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.5424398Z context = <triton._C.libtriton.ir.context object at 0x7f891775f2b0>
2025-05-07T20:33:04.5424680Z 
2025-05-07T20:33:04.5424844Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.5425365Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.5425825Z                            module_map=module_map)
2025-05-07T20:33:04.5426181Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.5426530Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.5426781Z E       ^
2025-05-07T20:33:04.5427295Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.5427743Z 
2025-05-07T20:33:04.5428160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.5428668Z 
2025-05-07T20:33:04.7487783Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.7488403Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.7488984Z     T=16384,
2025-05-07T20:33:04.7489248Z     D=5120,
2025-05-07T20:33:04.7489544Z     scale_ub=1200.0,
2025-05-07T20:33:04.7489842Z     contiguous=False,
2025-05-07T20:33:04.7490133Z     compiled=False,
2025-05-07T20:33:04.7490332Z )
2025-05-07T20:33:04.7490634Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.7491184Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:04.7491475Z 
2025-05-07T20:33:04.7491549Z     @given(
2025-05-07T20:33:04.7491770Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.7492071Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.7492376Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.7492703Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.7493027Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.7493311Z     )
2025-05-07T20:33:04.7493655Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.7494086Z     def test_silu_mul_quant(
2025-05-07T20:33:04.7494321Z         self,
2025-05-07T20:33:04.7494515Z         T: int,
2025-05-07T20:33:04.7494703Z         D: int,
2025-05-07T20:33:04.7494910Z         scale_ub: Optional[float],
2025-05-07T20:33:04.7495175Z         contiguous: bool,
2025-05-07T20:33:04.7495406Z         compiled: bool,
2025-05-07T20:33:04.7495623Z     ) -> None:
2025-05-07T20:33:04.7495841Z         torch.manual_seed(2025)
2025-05-07T20:33:04.7496078Z     
2025-05-07T20:33:04.7496340Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.7496678Z     
2025-05-07T20:33:04.7496867Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.7497147Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.7497452Z         x = x_sign * x_clamp
2025-05-07T20:33:04.7497814Z         x0 = x[:, :D]
2025-05-07T20:33:04.7498027Z         x1 = x[:, D:]
2025-05-07T20:33:04.7498228Z     
2025-05-07T20:33:04.7498410Z         if contiguous:
2025-05-07T20:33:04.7498633Z             x0 = x0.contiguous()
2025-05-07T20:33:04.7498947Z             x1 = x1.contiguous()
2025-05-07T20:33:04.7499193Z     
2025-05-07T20:33:04.7499376Z         if scale_ub is not None:
2025-05-07T20:33:04.7499657Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.7506757Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.7507092Z             )
2025-05-07T20:33:04.7507402Z         else:
2025-05-07T20:33:04.7507619Z             scale_ub_tensor = None
2025-05-07T20:33:04.7507880Z     
2025-05-07T20:33:04.7508112Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.7508448Z             op = silu_mul_quant
2025-05-07T20:33:04.7508702Z             if compiled:
2025-05-07T20:33:04.7508943Z                 op = torch.compile(op)
2025-05-07T20:33:04.7509246Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.7509519Z     
2025-05-07T20:33:04.7509713Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.7509952Z 
2025-05-07T20:33:04.7510053Z moe/activation_test.py:117: 
2025-05-07T20:33:04.7510352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.7510682Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.7510971Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.7511740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.7512452Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.7512990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.7513683Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.7514353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.7514888Z     kernel = self.compile(
2025-05-07T20:33:04.7515427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.7516086Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.7516482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.7516707Z 
2025-05-07T20:33:04.7516909Z self = <triton.compiler.compiler.ASTSource object at 0x7f891796d1c0>
2025-05-07T20:33:04.7518003Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.7519392Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f891777b940>}
2025-05-07T20:33:04.7520750Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.7521818Z context = <triton._C.libtriton.ir.context object at 0x7f89175f04b0>
2025-05-07T20:33:04.7522100Z 
2025-05-07T20:33:04.7522262Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.7522778Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.7523249Z                            module_map=module_map)
2025-05-07T20:33:04.7523607Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.7523956Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.7524206Z E       ^
2025-05-07T20:33:04.7524669Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.7525188Z 
2025-05-07T20:33:04.7525602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.7526172Z 
2025-05-07T20:33:04.7526275Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.7526685Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.7527090Z     T=16384,
2025-05-07T20:33:04.7527272Z     D=5120,
2025-05-07T20:33:04.7527461Z     scale_ub=1200.0,
2025-05-07T20:33:04.7527717Z     contiguous=True,
2025-05-07T20:33:04.7527928Z     compiled=True,
2025-05-07T20:33:04.7528128Z )
2025-05-07T20:33:04.7528442Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.7528927Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:04.7529201Z 
2025-05-07T20:33:04.7529272Z     @given(
2025-05-07T20:33:04.7529494Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.7529790Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.7530092Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.7530417Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.7530744Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.7531022Z     )
2025-05-07T20:33:04.7531412Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.7531845Z     def test_silu_mul_quant(
2025-05-07T20:33:04.7532077Z         self,
2025-05-07T20:33:04.7532309Z         T: int,
2025-05-07T20:33:04.7532502Z         D: int,
2025-05-07T20:33:04.7532707Z         scale_ub: Optional[float],
2025-05-07T20:33:04.7532968Z         contiguous: bool,
2025-05-07T20:33:04.7533204Z         compiled: bool,
2025-05-07T20:33:04.7533414Z     ) -> None:
2025-05-07T20:33:04.7533625Z         torch.manual_seed(2025)
2025-05-07T20:33:04.7533860Z     
2025-05-07T20:33:04.7534121Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.7534463Z     
2025-05-07T20:33:04.7534646Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.7534927Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.7535235Z         x = x_sign * x_clamp
2025-05-07T20:33:04.7535470Z         x0 = x[:, :D]
2025-05-07T20:33:04.7535680Z         x1 = x[:, D:]
2025-05-07T20:33:04.7535876Z     
2025-05-07T20:33:04.7536053Z         if contiguous:
2025-05-07T20:33:04.7536276Z             x0 = x0.contiguous()
2025-05-07T20:33:04.7536527Z             x1 = x1.contiguous()
2025-05-07T20:33:04.7536759Z     
2025-05-07T20:33:04.7536943Z         if scale_ub is not None:
2025-05-07T20:33:04.7537203Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.7537531Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.7537831Z             )
2025-05-07T20:33:04.7538012Z         else:
2025-05-07T20:33:04.7538222Z             scale_ub_tensor = None
2025-05-07T20:33:04.7538469Z     
2025-05-07T20:33:04.7538684Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.7538990Z             op = silu_mul_quant
2025-05-07T20:33:04.7539232Z             if compiled:
2025-05-07T20:33:04.7539469Z                 op = torch.compile(op)
2025-05-07T20:33:04.7539756Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.7540025Z     
2025-05-07T20:33:04.7540206Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.7540367Z 
2025-05-07T20:33:04.7540464Z moe/activation_test.py:117: 
2025-05-07T20:33:04.7540755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.7541081Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.7541349Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.7541901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:04.7542500Z     return fn(*args, **kwargs)
2025-05-07T20:33:04.7543148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.7543826Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.7544390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.7545064Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.7545713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.7546273Z     kernel = self.compile(
2025-05-07T20:33:04.7546802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.7547448Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.7547830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.7548062Z 
2025-05-07T20:33:04.7548265Z self = <triton.compiler.compiler.ASTSource object at 0x7f89179f57f0>
2025-05-07T20:33:04.7549352Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.7550802Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917551550>}
2025-05-07T20:33:04.7552188Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.7553210Z context = <triton._C.libtriton.ir.context object at 0x7f89175664b0>
2025-05-07T20:33:04.7553495Z 
2025-05-07T20:33:04.7553662Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.7554178Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.7554636Z                            module_map=module_map)
2025-05-07T20:33:04.7555007Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.7555351Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.7555598Z E       ^
2025-05-07T20:33:04.7556060Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.7556521Z 
2025-05-07T20:33:04.7556936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.7557444Z 
2025-05-07T20:33:04.9775810Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.9776443Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.9777003Z     T=16384,
2025-05-07T20:33:04.9777271Z     D=5120,
2025-05-07T20:33:04.9777487Z     scale_ub=None,
2025-05-07T20:33:04.9777702Z     contiguous=False,
2025-05-07T20:33:04.9777931Z     compiled=True,
2025-05-07T20:33:04.9778135Z )
2025-05-07T20:33:04.9778452Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.9778950Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:04.9779225Z 
2025-05-07T20:33:04.9779307Z     @given(
2025-05-07T20:33:04.9779532Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.9779852Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.9780163Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.9780492Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.9780822Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.9781141Z     )
2025-05-07T20:33:04.9781514Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.9782067Z     def test_silu_mul_quant(
2025-05-07T20:33:04.9782313Z         self,
2025-05-07T20:33:04.9782506Z         T: int,
2025-05-07T20:33:04.9782699Z         D: int,
2025-05-07T20:33:04.9782980Z         scale_ub: Optional[float],
2025-05-07T20:33:04.9783255Z         contiguous: bool,
2025-05-07T20:33:04.9783494Z         compiled: bool,
2025-05-07T20:33:04.9783721Z     ) -> None:
2025-05-07T20:33:04.9783936Z         torch.manual_seed(2025)
2025-05-07T20:33:04.9784175Z     
2025-05-07T20:33:04.9784449Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.9784859Z     
2025-05-07T20:33:04.9785045Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.9785343Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.9785658Z         x = x_sign * x_clamp
2025-05-07T20:33:04.9785896Z         x0 = x[:, :D]
2025-05-07T20:33:04.9786114Z         x1 = x[:, D:]
2025-05-07T20:33:04.9786328Z     
2025-05-07T20:33:04.9786509Z         if contiguous:
2025-05-07T20:33:04.9786743Z             x0 = x0.contiguous()
2025-05-07T20:33:04.9787002Z             x1 = x1.contiguous()
2025-05-07T20:33:04.9787243Z     
2025-05-07T20:33:04.9787431Z         if scale_ub is not None:
2025-05-07T20:33:04.9787709Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.9788046Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.9788351Z             )
2025-05-07T20:33:04.9788549Z         else:
2025-05-07T20:33:04.9788770Z             scale_ub_tensor = None
2025-05-07T20:33:04.9789029Z     
2025-05-07T20:33:04.9789337Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.9789653Z             op = silu_mul_quant
2025-05-07T20:33:04.9789980Z             if compiled:
2025-05-07T20:33:04.9790232Z                 op = torch.compile(op)
2025-05-07T20:33:04.9790532Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.9790801Z     
2025-05-07T20:33:04.9790993Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.9791203Z 
2025-05-07T20:33:04.9791329Z moe/activation_test.py:117: 
2025-05-07T20:33:04.9791633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.9791962Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.9792253Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.9792819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:04.9793371Z     return fn(*args, **kwargs)
2025-05-07T20:33:04.9794044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.9794737Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.9795275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.9795954Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.9796621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.9797150Z     kernel = self.compile(
2025-05-07T20:33:04.9797691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.9798349Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.9798745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.9798970Z 
2025-05-07T20:33:04.9799187Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917748100>
2025-05-07T20:33:04.9800269Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.9801654Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f891764a1f0>}
2025-05-07T20:33:04.9803124Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.9804649Z context = <triton._C.libtriton.ir.context object at 0x7f89176384f0>
2025-05-07T20:33:04.9804937Z 
2025-05-07T20:33:04.9805109Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.9805709Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.9806180Z                            module_map=module_map)
2025-05-07T20:33:04.9806547Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.9806895Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.9807150Z E       ^
2025-05-07T20:33:04.9807616Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.9808071Z 
2025-05-07T20:33:04.9808498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.9809014Z 
2025-05-07T20:33:04.9809115Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.9809529Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.9809934Z     T=2048,
2025-05-07T20:33:04.9810114Z     D=5120,
2025-05-07T20:33:04.9810370Z     scale_ub=None,
2025-05-07T20:33:04.9810587Z     contiguous=False,
2025-05-07T20:33:04.9810807Z     compiled=True,
2025-05-07T20:33:04.9811010Z )
2025-05-07T20:33:05.1020549Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.1021603Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:05.1022383Z 
2025-05-07T20:33:05.1022595Z     @given(
2025-05-07T20:33:05.1023185Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.1024033Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.1024673Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.1025323Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.1025973Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.1026530Z     )
2025-05-07T20:33:05.1027209Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.1028085Z     def test_silu_mul_quant(
2025-05-07T20:33:05.1028559Z         self,
2025-05-07T20:33:05.1028938Z         T: int,
2025-05-07T20:33:05.1029314Z         D: int,
2025-05-07T20:33:05.1029742Z         scale_ub: Optional[float],
2025-05-07T20:33:05.1030373Z         contiguous: bool,
2025-05-07T20:33:05.1030831Z         compiled: bool,
2025-05-07T20:33:05.1031124Z     ) -> None:
2025-05-07T20:33:05.1031341Z         torch.manual_seed(2025)
2025-05-07T20:33:05.1031572Z     
2025-05-07T20:33:05.1031843Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.1032182Z     
2025-05-07T20:33:05.1032364Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.1032663Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.1032968Z         x = x_sign * x_clamp
2025-05-07T20:33:05.1033201Z         x0 = x[:, :D]
2025-05-07T20:33:05.1033417Z         x1 = x[:, D:]
2025-05-07T20:33:05.1033620Z     
2025-05-07T20:33:05.1033797Z         if contiguous:
2025-05-07T20:33:05.1034028Z             x0 = x0.contiguous()
2025-05-07T20:33:05.1034285Z             x1 = x1.contiguous()
2025-05-07T20:33:05.1034520Z     
2025-05-07T20:33:05.1034703Z         if scale_ub is not None:
2025-05-07T20:33:05.1034976Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.1035309Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.1035614Z             )
2025-05-07T20:33:05.1035919Z         else:
2025-05-07T20:33:05.1036127Z             scale_ub_tensor = None
2025-05-07T20:33:05.1036369Z     
2025-05-07T20:33:05.1036602Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.1036913Z             op = silu_mul_quant
2025-05-07T20:33:05.1037222Z             if compiled:
2025-05-07T20:33:05.1037473Z                 op = torch.compile(op)
2025-05-07T20:33:05.1037766Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.1038033Z     
2025-05-07T20:33:05.1038224Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.1038386Z 
2025-05-07T20:33:05.1038549Z moe/activation_test.py:117: 
2025-05-07T20:33:05.1038840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.1039166Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.1039451Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.1040015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.1040573Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.1041233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.1041924Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.1042456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.1043138Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.1043858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.1044392Z     kernel = self.compile(
2025-05-07T20:33:05.1044925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.1045583Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.1045982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.1046214Z 
2025-05-07T20:33:05.1046420Z self = <triton.compiler.compiler.ASTSource object at 0x7f89175b4c40>
2025-05-07T20:33:05.1047504Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.1048897Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f891764af70>}
2025-05-07T20:33:05.1050250Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.1051273Z context = <triton._C.libtriton.ir.context object at 0x7f891745b6b0>
2025-05-07T20:33:05.1051560Z 
2025-05-07T20:33:05.1051729Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.1052246Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.1052715Z                            module_map=module_map)
2025-05-07T20:33:05.1053087Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.1053432Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.1053688Z E       ^
2025-05-07T20:33:05.1054157Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.1054618Z 
2025-05-07T20:33:05.1055042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.1055560Z 
2025-05-07T20:33:05.1055665Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.1056081Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.1056548Z     T=2048,
2025-05-07T20:33:05.1056733Z     D=5120,
2025-05-07T20:33:05.1056926Z     scale_ub=1200.0,
2025-05-07T20:33:05.1057153Z     contiguous=False,
2025-05-07T20:33:05.1057368Z     compiled=True,
2025-05-07T20:33:05.1057610Z )
2025-05-07T20:33:05.1057948Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.1058448Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:05.1058721Z 
2025-05-07T20:33:05.1058806Z     @given(
2025-05-07T20:33:05.1059030Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.1059380Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.1059682Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.1060006Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.1060332Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.1060617Z     )
2025-05-07T20:33:05.1060964Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.1061404Z     def test_silu_mul_quant(
2025-05-07T20:33:05.1061646Z         self,
2025-05-07T20:33:05.1061835Z         T: int,
2025-05-07T20:33:05.1062026Z         D: int,
2025-05-07T20:33:05.1062247Z         scale_ub: Optional[float],
2025-05-07T20:33:05.1062514Z         contiguous: bool,
2025-05-07T20:33:05.1062747Z         compiled: bool,
2025-05-07T20:33:05.1062970Z     ) -> None:
2025-05-07T20:33:05.1063182Z         torch.manual_seed(2025)
2025-05-07T20:33:05.1063415Z     
2025-05-07T20:33:05.1063745Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.1064086Z     
2025-05-07T20:33:05.1064274Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.1064568Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.1064873Z         x = x_sign * x_clamp
2025-05-07T20:33:05.1065114Z         x0 = x[:, :D]
2025-05-07T20:33:05.1065337Z         x1 = x[:, D:]
2025-05-07T20:33:05.1065539Z     
2025-05-07T20:33:05.1065719Z         if contiguous:
2025-05-07T20:33:05.1065945Z             x0 = x0.contiguous()
2025-05-07T20:33:05.1066203Z             x1 = x1.contiguous()
2025-05-07T20:33:05.1066441Z     
2025-05-07T20:33:05.1066644Z         if scale_ub is not None:
2025-05-07T20:33:05.1066916Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.1067248Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.1067554Z             )
2025-05-07T20:33:05.1067746Z         else:
2025-05-07T20:33:05.1067955Z             scale_ub_tensor = None
2025-05-07T20:33:05.1068200Z     
2025-05-07T20:33:05.1068426Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.1068743Z             op = silu_mul_quant
2025-05-07T20:33:05.1068986Z             if compiled:
2025-05-07T20:33:05.1069227Z                 op = torch.compile(op)
2025-05-07T20:33:05.1069520Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.1069790Z     
2025-05-07T20:33:05.1070022Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.1070186Z 
2025-05-07T20:33:05.1070287Z moe/activation_test.py:117: 
2025-05-07T20:33:05.1070577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.1070909Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.1071212Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.1071789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.1072339Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.1073001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.1073690Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.1074220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.1074957Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.1075616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.1076146Z     kernel = self.compile(
2025-05-07T20:33:05.1076713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.1077367Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.1077759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.1078029Z 
2025-05-07T20:33:05.1078237Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917475a90>
2025-05-07T20:33:05.1079327Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.1080712Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f891746e940>}
2025-05-07T20:33:05.1082070Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.1083100Z context = <triton._C.libtriton.ir.context object at 0x7f89174d0570>
2025-05-07T20:33:05.1083384Z 
2025-05-07T20:33:05.1083615Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.1084145Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.1084613Z                            module_map=module_map)
2025-05-07T20:33:05.1084990Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.1085343Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.1085600Z E       ^
2025-05-07T20:33:05.1086065Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.1086519Z 
2025-05-07T20:33:05.1086953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.1087480Z 
2025-05-07T20:33:05.5054716Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.5055311Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.5055912Z     T=4096,
2025-05-07T20:33:05.5056172Z     D=5120,
2025-05-07T20:33:05.5056417Z     scale_ub=1200.0,
2025-05-07T20:33:05.5056699Z     contiguous=True,
2025-05-07T20:33:05.5056916Z     compiled=True,
2025-05-07T20:33:05.5057103Z )
2025-05-07T20:33:05.5057447Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.5057949Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:05.5058222Z 
2025-05-07T20:33:05.5058302Z     @given(
2025-05-07T20:33:05.5058525Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.5058838Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.5059146Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.5059476Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.5059798Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.5060082Z     )
2025-05-07T20:33:05.5060434Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.5060873Z     def test_silu_mul_quant(
2025-05-07T20:33:05.5061135Z         self,
2025-05-07T20:33:05.5061351Z         T: int,
2025-05-07T20:33:05.5061541Z         D: int,
2025-05-07T20:33:05.5061754Z         scale_ub: Optional[float],
2025-05-07T20:33:05.5062025Z         contiguous: bool,
2025-05-07T20:33:05.5062256Z         compiled: bool,
2025-05-07T20:33:05.5062599Z     ) -> None:
2025-05-07T20:33:05.5062812Z         torch.manual_seed(2025)
2025-05-07T20:33:05.5063047Z     
2025-05-07T20:33:05.5063318Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.5063662Z     
2025-05-07T20:33:05.5063911Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.5064202Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.5064509Z         x = x_sign * x_clamp
2025-05-07T20:33:05.5064742Z         x0 = x[:, :D]
2025-05-07T20:33:05.5064948Z         x1 = x[:, D:]
2025-05-07T20:33:05.5065152Z     
2025-05-07T20:33:05.5065409Z         if contiguous:
2025-05-07T20:33:05.5065634Z             x0 = x0.contiguous()
2025-05-07T20:33:05.5065899Z             x1 = x1.contiguous()
2025-05-07T20:33:05.5066143Z     
2025-05-07T20:33:05.5066328Z         if scale_ub is not None:
2025-05-07T20:33:05.5066599Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.5066933Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.5067240Z             )
2025-05-07T20:33:05.5067429Z         else:
2025-05-07T20:33:05.5067632Z             scale_ub_tensor = None
2025-05-07T20:33:05.5067873Z     
2025-05-07T20:33:05.5068097Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.5068411Z             op = silu_mul_quant
2025-05-07T20:33:05.5068650Z             if compiled:
2025-05-07T20:33:05.5068894Z                 op = torch.compile(op)
2025-05-07T20:33:05.5069187Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.5069457Z     
2025-05-07T20:33:05.5069642Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.5069969Z 
2025-05-07T20:33:05.5070069Z moe/activation_test.py:117: 
2025-05-07T20:33:05.5070361Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.5070687Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.5070964Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.5071560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.5072117Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.5072780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.5073460Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.5073993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.5074670Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.5075337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.5075855Z     kernel = self.compile(
2025-05-07T20:33:05.5076393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.5077045Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.5077438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.5077668Z 
2025-05-07T20:33:05.5077871Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917501490>
2025-05-07T20:33:05.5078959Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.5080343Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917409790>}
2025-05-07T20:33:05.5081685Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.5082749Z context = <triton._C.libtriton.ir.context object at 0x7f89174435f0>
2025-05-07T20:33:05.5083035Z 
2025-05-07T20:33:05.5083198Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.5083754Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.5084222Z                            module_map=module_map)
2025-05-07T20:33:05.5084582Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.5084937Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.5085195Z E       ^
2025-05-07T20:33:05.5085703Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.5086160Z 
2025-05-07T20:33:05.5086576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.5087087Z 
2025-05-07T20:33:05.5087188Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.5087596Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.5087992Z     T=128,
2025-05-07T20:33:05.5088173Z     D=5120,
2025-05-07T20:33:05.5088362Z     scale_ub=1200.0,
2025-05-07T20:33:05.5088579Z     contiguous=False,
2025-05-07T20:33:05.5088808Z     compiled=True,
2025-05-07T20:33:05.5089010Z )
2025-05-07T20:33:05.6408038Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.6408838Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:05.6409212Z 
2025-05-07T20:33:05.6409437Z     @given(
2025-05-07T20:33:05.6409758Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.6410068Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.6410375Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.6410697Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.6411027Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.6411353Z     )
2025-05-07T20:33:05.6411691Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.6412131Z     def test_silu_mul_quant(
2025-05-07T20:33:05.6412370Z         self,
2025-05-07T20:33:05.6412560Z         T: int,
2025-05-07T20:33:05.6412753Z         D: int,
2025-05-07T20:33:05.6412967Z         scale_ub: Optional[float],
2025-05-07T20:33:05.6413235Z         contiguous: bool,
2025-05-07T20:33:05.6413468Z         compiled: bool,
2025-05-07T20:33:05.6413690Z     ) -> None:
2025-05-07T20:33:05.6413898Z         torch.manual_seed(2025)
2025-05-07T20:33:05.6414146Z     
2025-05-07T20:33:05.6414418Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.6414754Z     
2025-05-07T20:33:05.6414942Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.6415235Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.6415542Z         x = x_sign * x_clamp
2025-05-07T20:33:05.6415778Z         x0 = x[:, :D]
2025-05-07T20:33:05.6415988Z         x1 = x[:, D:]
2025-05-07T20:33:05.6416195Z     
2025-05-07T20:33:05.6416373Z         if contiguous:
2025-05-07T20:33:05.6416602Z             x0 = x0.contiguous()
2025-05-07T20:33:05.6416856Z             x1 = x1.contiguous()
2025-05-07T20:33:05.6417093Z     
2025-05-07T20:33:05.6417283Z         if scale_ub is not None:
2025-05-07T20:33:05.6417564Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.6417894Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.6418208Z             )
2025-05-07T20:33:05.6418398Z         else:
2025-05-07T20:33:05.6418604Z             scale_ub_tensor = None
2025-05-07T20:33:05.6418850Z     
2025-05-07T20:33:05.6419079Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.6419396Z             op = silu_mul_quant
2025-05-07T20:33:05.6419639Z             if compiled:
2025-05-07T20:33:05.6419882Z                 op = torch.compile(op)
2025-05-07T20:33:05.6420252Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.6420517Z     
2025-05-07T20:33:05.6420707Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.6420874Z 
2025-05-07T20:33:05.6420979Z moe/activation_test.py:117: 
2025-05-07T20:33:05.6421352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.6421707Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.6421982Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.6422533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.6423146Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.6423804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.6424492Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.6425018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.6425701Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.6426360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.6426891Z     kernel = self.compile(
2025-05-07T20:33:05.6427428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.6428078Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.6428518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.6428748Z 
2025-05-07T20:33:05.6428954Z self = <triton.compiler.compiler.ASTSource object at 0x7f89172f1910>
2025-05-07T20:33:05.6430114Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.6431505Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89172fe0d0>}
2025-05-07T20:33:05.6432852Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.6433869Z context = <triton._C.libtriton.ir.context object at 0x7f89173050b0>
2025-05-07T20:33:05.6434159Z 
2025-05-07T20:33:05.6434327Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.6434844Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.6435315Z                            module_map=module_map)
2025-05-07T20:33:05.6435681Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.6436028Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.6436286Z E       ^
2025-05-07T20:33:05.6436757Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.6437206Z 
2025-05-07T20:33:05.6437632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.6438155Z 
2025-05-07T20:33:05.6438258Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.6438670Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.6439076Z     T=16384,
2025-05-07T20:33:05.6439262Z     D=7168,
2025-05-07T20:33:05.6439454Z     scale_ub=1200.0,
2025-05-07T20:33:05.6439671Z     contiguous=True,
2025-05-07T20:33:05.6439891Z     compiled=True,
2025-05-07T20:33:05.6440094Z )
2025-05-07T20:33:05.6440409Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.6440948Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:05.6441227Z 
2025-05-07T20:33:05.6441302Z     @given(
2025-05-07T20:33:05.6441532Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.6441877Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.6442181Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.6442513Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.6442844Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.6443130Z     )
2025-05-07T20:33:05.6443537Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.6443976Z     def test_silu_mul_quant(
2025-05-07T20:33:05.6444224Z         self,
2025-05-07T20:33:05.6444410Z         T: int,
2025-05-07T20:33:05.6444603Z         D: int,
2025-05-07T20:33:05.6444817Z         scale_ub: Optional[float],
2025-05-07T20:33:05.6445082Z         contiguous: bool,
2025-05-07T20:33:05.6445327Z         compiled: bool,
2025-05-07T20:33:05.6445542Z     ) -> None:
2025-05-07T20:33:05.6445751Z         torch.manual_seed(2025)
2025-05-07T20:33:05.6445991Z     
2025-05-07T20:33:05.6446259Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.6446596Z     
2025-05-07T20:33:05.6446786Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.6447067Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.6447372Z         x = x_sign * x_clamp
2025-05-07T20:33:05.6447603Z         x0 = x[:, :D]
2025-05-07T20:33:05.6447813Z         x1 = x[:, D:]
2025-05-07T20:33:05.6448063Z     
2025-05-07T20:33:05.6448233Z         if contiguous:
2025-05-07T20:33:05.6448458Z             x0 = x0.contiguous()
2025-05-07T20:33:05.6448707Z             x1 = x1.contiguous()
2025-05-07T20:33:05.6448935Z     
2025-05-07T20:33:05.6449112Z         if scale_ub is not None:
2025-05-07T20:33:05.6449385Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.6449709Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.6450009Z             )
2025-05-07T20:33:05.6450193Z         else:
2025-05-07T20:33:05.6450388Z             scale_ub_tensor = None
2025-05-07T20:33:05.6450629Z     
2025-05-07T20:33:05.6450853Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.6451155Z             op = silu_mul_quant
2025-05-07T20:33:05.6451398Z             if compiled:
2025-05-07T20:33:05.6451635Z                 op = torch.compile(op)
2025-05-07T20:33:05.6451920Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.6452187Z     
2025-05-07T20:33:05.6452377Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.6452538Z 
2025-05-07T20:33:05.6452636Z moe/activation_test.py:117: 
2025-05-07T20:33:05.6452919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.6453239Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.6453511Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.6454059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.6454611Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.6455268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.6455949Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.6456473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.6457152Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.6457810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.6458329Z     kernel = self.compile(
2025-05-07T20:33:05.6458862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.6459552Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.6459939Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.6460160Z 
2025-05-07T20:33:05.6460400Z self = <triton.compiler.compiler.ASTSource object at 0x7f89174ff850>
2025-05-07T20:33:05.6461536Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.6462943Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89172fed30>}
2025-05-07T20:33:05.6464281Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.6465296Z context = <triton._C.libtriton.ir.context object at 0x7f89172c9f70>
2025-05-07T20:33:05.6465577Z 
2025-05-07T20:33:05.6465740Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.6466261Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.6466721Z                            module_map=module_map)
2025-05-07T20:33:05.6467084Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.6467434Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.6467730Z E       ^
2025-05-07T20:33:05.6468193Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.6468650Z 
2025-05-07T20:33:05.6469068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.6469582Z 
2025-05-07T20:33:05.9231686Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.9232296Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.9232843Z     T=16384,
2025-05-07T20:33:05.9233091Z     D=5120,
2025-05-07T20:33:05.9233341Z     scale_ub=1200.0,
2025-05-07T20:33:05.9233649Z     contiguous=True,
2025-05-07T20:33:05.9233873Z     compiled=False,
2025-05-07T20:33:05.9234079Z )
2025-05-07T20:33:05.9234396Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.9234887Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:05.9235175Z 
2025-05-07T20:33:05.9235251Z     @given(
2025-05-07T20:33:05.9235476Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.9235788Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.9236127Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.9236447Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.9236809Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.9237213Z     )
2025-05-07T20:33:05.9237691Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.9238248Z     def test_silu_mul_quant(
2025-05-07T20:33:05.9238492Z         self,
2025-05-07T20:33:05.9238686Z         T: int,
2025-05-07T20:33:05.9238873Z         D: int,
2025-05-07T20:33:05.9239086Z         scale_ub: Optional[float],
2025-05-07T20:33:05.9239355Z         contiguous: bool,
2025-05-07T20:33:05.9239585Z         compiled: bool,
2025-05-07T20:33:05.9239802Z     ) -> None:
2025-05-07T20:33:05.9240018Z         torch.manual_seed(2025)
2025-05-07T20:33:05.9240253Z     
2025-05-07T20:33:05.9240518Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.9240852Z     
2025-05-07T20:33:05.9241035Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.9241323Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.9241802Z         x = x_sign * x_clamp
2025-05-07T20:33:05.9242038Z         x0 = x[:, :D]
2025-05-07T20:33:05.9242282Z         x1 = x[:, D:]
2025-05-07T20:33:05.9242489Z     
2025-05-07T20:33:05.9242675Z         if contiguous:
2025-05-07T20:33:05.9242906Z             x0 = x0.contiguous()
2025-05-07T20:33:05.9243224Z             x1 = x1.contiguous()
2025-05-07T20:33:05.9243468Z     
2025-05-07T20:33:05.9243653Z         if scale_ub is not None:
2025-05-07T20:33:05.9243916Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.9244245Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.9244617Z             )
2025-05-07T20:33:05.9244808Z         else:
2025-05-07T20:33:05.9245006Z             scale_ub_tensor = None
2025-05-07T20:33:05.9245253Z     
2025-05-07T20:33:05.9245482Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.9245787Z             op = silu_mul_quant
2025-05-07T20:33:05.9246031Z             if compiled:
2025-05-07T20:33:05.9246276Z                 op = torch.compile(op)
2025-05-07T20:33:05.9246564Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.9246831Z     
2025-05-07T20:33:05.9247019Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.9247183Z 
2025-05-07T20:33:05.9247280Z moe/activation_test.py:117: 
2025-05-07T20:33:05.9247573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.9247906Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.9248186Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.9248941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.9249634Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.9250173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.9250850Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.9251511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.9252038Z     kernel = self.compile(
2025-05-07T20:33:05.9252577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.9253217Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.9253608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.9253835Z 
2025-05-07T20:33:05.9254046Z self = <triton.compiler.compiler.ASTSource object at 0x7f89176f4fa0>
2025-05-07T20:33:05.9255141Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.9256516Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f891725c700>}
2025-05-07T20:33:05.9257871Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.9258891Z context = <triton._C.libtriton.ir.context object at 0x7f89176e9370>
2025-05-07T20:33:05.9259175Z 
2025-05-07T20:33:05.9259351Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.9259872Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.9260337Z                            module_map=module_map)
2025-05-07T20:33:05.9260696Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.9261046Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.9261296Z E       ^
2025-05-07T20:33:05.9261868Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.9262317Z 
2025-05-07T20:33:05.9262779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.9263291Z 
2025-05-07T20:33:05.9263395Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.9263806Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.9264210Z     T=1,
2025-05-07T20:33:05.9264385Z     D=7168,
2025-05-07T20:33:05.9264568Z     scale_ub=1200.0,
2025-05-07T20:33:05.9264852Z     contiguous=False,
2025-05-07T20:33:05.9265076Z     compiled=False,
2025-05-07T20:33:05.9265270Z )
2025-05-07T20:33:05.9265584Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.9266067Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:05.9266331Z 
2025-05-07T20:33:05.9266410Z     @given(
2025-05-07T20:33:05.9266635Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.9266941Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.9267242Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.9267566Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.9267890Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.9268169Z     )
2025-05-07T20:33:05.9268509Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.9268944Z     def test_silu_mul_quant(
2025-05-07T20:33:05.9269227Z         self,
2025-05-07T20:33:05.9269415Z         T: int,
2025-05-07T20:33:05.9269606Z         D: int,
2025-05-07T20:33:05.9269904Z         scale_ub: Optional[float],
2025-05-07T20:33:05.9270161Z         contiguous: bool,
2025-05-07T20:33:05.9270394Z         compiled: bool,
2025-05-07T20:33:05.9270608Z     ) -> None:
2025-05-07T20:33:05.9270811Z         torch.manual_seed(2025)
2025-05-07T20:33:05.9271049Z     
2025-05-07T20:33:05.9271313Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.9271694Z     
2025-05-07T20:33:05.9271879Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.9272168Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.9272469Z         x = x_sign * x_clamp
2025-05-07T20:33:05.9278507Z         x0 = x[:, :D]
2025-05-07T20:33:05.9278759Z         x1 = x[:, D:]
2025-05-07T20:33:05.9278963Z     
2025-05-07T20:33:05.9279142Z         if contiguous:
2025-05-07T20:33:05.9279374Z             x0 = x0.contiguous()
2025-05-07T20:33:05.9279642Z             x1 = x1.contiguous()
2025-05-07T20:33:05.9279867Z     
2025-05-07T20:33:05.9280056Z         if scale_ub is not None:
2025-05-07T20:33:05.9280324Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.9280656Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.9280957Z             )
2025-05-07T20:33:05.9281145Z         else:
2025-05-07T20:33:05.9281341Z             scale_ub_tensor = None
2025-05-07T20:33:05.9281594Z     
2025-05-07T20:33:05.9281824Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.9282131Z             op = silu_mul_quant
2025-05-07T20:33:05.9282375Z             if compiled:
2025-05-07T20:33:05.9282614Z                 op = torch.compile(op)
2025-05-07T20:33:05.9282898Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.9283164Z     
2025-05-07T20:33:05.9283346Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.9283505Z 
2025-05-07T20:33:05.9283607Z moe/activation_test.py:117: 
2025-05-07T20:33:05.9283898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.9284228Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.9284498Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.9285183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.9285951Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.9286481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.9287200Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.9287852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.9288380Z     kernel = self.compile(
2025-05-07T20:33:05.9288915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.9289597Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.9289988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.9290223Z 
2025-05-07T20:33:05.9290426Z self = <triton.compiler.compiler.ASTSource object at 0x7f89176eb7c0>
2025-05-07T20:33:05.9291506Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.9292884Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89173940d0>}
2025-05-07T20:33:05.9294266Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.9295290Z context = <triton._C.libtriton.ir.context object at 0x7f8917396a70>
2025-05-07T20:33:05.9295572Z 
2025-05-07T20:33:05.9295741Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.9296260Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.9296726Z                            module_map=module_map)
2025-05-07T20:33:05.9297089Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.9297440Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.9297686Z E       ^
2025-05-07T20:33:05.9298151Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.9298597Z 
2025-05-07T20:33:05.9299015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.9299527Z 
2025-05-07T20:33:05.9299638Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.9300040Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.9300445Z     T=4096,
2025-05-07T20:33:05.9300618Z     D=7168,
2025-05-07T20:33:05.9300792Z     scale_ub=1200.0,
2025-05-07T20:33:05.9301006Z     contiguous=False,
2025-05-07T20:33:05.9301225Z     compiled=True,
2025-05-07T20:33:05.9301417Z )
2025-05-07T20:33:06.0470138Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.0470833Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:06.0471244Z 
2025-05-07T20:33:06.0471352Z     @given(
2025-05-07T20:33:06.0471672Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.0472120Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.0472488Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.0472829Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.0473151Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.0473433Z     )
2025-05-07T20:33:06.0473780Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.0474215Z     def test_silu_mul_quant(
2025-05-07T20:33:06.0474456Z         self,
2025-05-07T20:33:06.0474768Z         T: int,
2025-05-07T20:33:06.0474960Z         D: int,
2025-05-07T20:33:06.0475178Z         scale_ub: Optional[float],
2025-05-07T20:33:06.0475447Z         contiguous: bool,
2025-05-07T20:33:06.0475683Z         compiled: bool,
2025-05-07T20:33:06.0475906Z     ) -> None:
2025-05-07T20:33:06.0476186Z         torch.manual_seed(2025)
2025-05-07T20:33:06.0476428Z     
2025-05-07T20:33:06.0476697Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.0477037Z     
2025-05-07T20:33:06.0477226Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.0477516Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.0477889Z         x = x_sign * x_clamp
2025-05-07T20:33:06.0478127Z         x0 = x[:, :D]
2025-05-07T20:33:06.0478335Z         x1 = x[:, D:]
2025-05-07T20:33:06.0478532Z     
2025-05-07T20:33:06.0478714Z         if contiguous:
2025-05-07T20:33:06.0478939Z             x0 = x0.contiguous()
2025-05-07T20:33:06.0479188Z             x1 = x1.contiguous()
2025-05-07T20:33:06.0479424Z     
2025-05-07T20:33:06.0479601Z         if scale_ub is not None:
2025-05-07T20:33:06.0479864Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.0480195Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.0480499Z             )
2025-05-07T20:33:06.0480686Z         else:
2025-05-07T20:33:06.0480891Z             scale_ub_tensor = None
2025-05-07T20:33:06.0481131Z     
2025-05-07T20:33:06.0481362Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.0481674Z             op = silu_mul_quant
2025-05-07T20:33:06.0482014Z             if compiled:
2025-05-07T20:33:06.0482249Z                 op = torch.compile(op)
2025-05-07T20:33:06.0482542Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.0482809Z     
2025-05-07T20:33:06.0482988Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.0483152Z 
2025-05-07T20:33:06.0483248Z moe/activation_test.py:117: 
2025-05-07T20:33:06.0483537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.0483866Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.0484135Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.0484692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:06.0485247Z     return fn(*args, **kwargs)
2025-05-07T20:33:06.0485903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.0486590Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.0487128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.0487805Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.0488463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.0488991Z     kernel = self.compile(
2025-05-07T20:33:06.0489526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.0490176Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.0490568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.0490798Z 
2025-05-07T20:33:06.0490997Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917196100>
2025-05-07T20:33:06.0492133Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.0493524Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917394dc0>}
2025-05-07T20:33:06.0494918Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.0495979Z context = <triton._C.libtriton.ir.context object at 0x7f8917312a30>
2025-05-07T20:33:06.0496266Z 
2025-05-07T20:33:06.0496430Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.0496947Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.0497409Z                            module_map=module_map)
2025-05-07T20:33:06.0497813Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.0498161Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.0498409Z E       ^
2025-05-07T20:33:06.0498870Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.0499339Z 
2025-05-07T20:33:06.0499756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.0500266Z 
2025-05-07T20:33:06.0500373Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.0500784Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.0501186Z     T=128,
2025-05-07T20:33:06.0501372Z     D=7168,
2025-05-07T20:33:06.0501576Z     scale_ub=1200.0,
2025-05-07T20:33:06.0501826Z     contiguous=False,
2025-05-07T20:33:06.0502051Z     compiled=True,
2025-05-07T20:33:06.0502248Z )
2025-05-07T20:33:06.0502603Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.0503095Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:06.0503361Z 
2025-05-07T20:33:06.0503442Z     @given(
2025-05-07T20:33:06.0503664Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.0504158Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.0504460Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.0504779Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.0505098Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.0505380Z     )
2025-05-07T20:33:06.0505715Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.0506153Z     def test_silu_mul_quant(
2025-05-07T20:33:06.0506389Z         self,
2025-05-07T20:33:06.0506576Z         T: int,
2025-05-07T20:33:06.0506756Z         D: int,
2025-05-07T20:33:06.0506973Z         scale_ub: Optional[float],
2025-05-07T20:33:06.0507233Z         contiguous: bool,
2025-05-07T20:33:06.0507467Z         compiled: bool,
2025-05-07T20:33:06.0507680Z     ) -> None:
2025-05-07T20:33:06.0507888Z         torch.manual_seed(2025)
2025-05-07T20:33:06.0508122Z     
2025-05-07T20:33:06.0508392Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.0508741Z     
2025-05-07T20:33:06.0508919Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.0509195Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.0509499Z         x = x_sign * x_clamp
2025-05-07T20:33:06.0509727Z         x0 = x[:, :D]
2025-05-07T20:33:06.0509983Z         x1 = x[:, D:]
2025-05-07T20:33:06.0510187Z     
2025-05-07T20:33:06.0510359Z         if contiguous:
2025-05-07T20:33:06.0510582Z             x0 = x0.contiguous()
2025-05-07T20:33:06.0510832Z             x1 = x1.contiguous()
2025-05-07T20:33:06.0511061Z     
2025-05-07T20:33:06.0511251Z         if scale_ub is not None:
2025-05-07T20:33:06.0511515Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.0511840Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.0512132Z             )
2025-05-07T20:33:06.0512314Z         else:
2025-05-07T20:33:06.0512517Z             scale_ub_tensor = None
2025-05-07T20:33:06.0512756Z     
2025-05-07T20:33:06.0513054Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.0513367Z             op = silu_mul_quant
2025-05-07T20:33:06.0513603Z             if compiled:
2025-05-07T20:33:06.0513842Z                 op = torch.compile(op)
2025-05-07T20:33:06.0514199Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.0514467Z     
2025-05-07T20:33:06.0514647Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.0514808Z 
2025-05-07T20:33:06.0514906Z moe/activation_test.py:117: 
2025-05-07T20:33:06.0515200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.0515616Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.0515886Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.0516444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:06.0516993Z     return fn(*args, **kwargs)
2025-05-07T20:33:06.0517643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.0518327Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.0518859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.0519536Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.0520192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.0520715Z     kernel = self.compile(
2025-05-07T20:33:06.0521312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.0521962Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.0522351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.0522580Z 
2025-05-07T20:33:06.0522781Z self = <triton.compiler.compiler.ASTSource object at 0x7f89171d41c0>
2025-05-07T20:33:06.0523866Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.0525244Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89171c1940>}
2025-05-07T20:33:06.0526588Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.0527611Z context = <triton._C.libtriton.ir.context object at 0x7f8917187a30>
2025-05-07T20:33:06.0527893Z 
2025-05-07T20:33:06.0528061Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.0528582Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.0529039Z                            module_map=module_map)
2025-05-07T20:33:06.0529399Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.0529752Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.0529994Z E       ^
2025-05-07T20:33:06.0530447Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.0530899Z 
2025-05-07T20:33:06.0531326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.0531839Z 
2025-05-07T20:33:06.2255615Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.2256646Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.2257169Z     T=2048,
2025-05-07T20:33:06.2257445Z     D=7168,
2025-05-07T20:33:06.2257712Z     scale_ub=None,
2025-05-07T20:33:06.2258254Z     contiguous=True,
2025-05-07T20:33:06.2258493Z     compiled=True,
2025-05-07T20:33:06.2258709Z )
2025-05-07T20:33:06.2259034Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.2259679Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:06.2259955Z 
2025-05-07T20:33:06.2260045Z     @given(
2025-05-07T20:33:06.2260280Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.2260607Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.2260935Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.2261381Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.2261754Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.2262050Z     )
2025-05-07T20:33:06.2262412Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.2262854Z     def test_silu_mul_quant(
2025-05-07T20:33:06.2263112Z         self,
2025-05-07T20:33:06.2263318Z         T: int,
2025-05-07T20:33:06.2263518Z         D: int,
2025-05-07T20:33:06.2263746Z         scale_ub: Optional[float],
2025-05-07T20:33:06.2264027Z         contiguous: bool,
2025-05-07T20:33:06.2264267Z         compiled: bool,
2025-05-07T20:33:06.2264508Z     ) -> None:
2025-05-07T20:33:06.2264736Z         torch.manual_seed(2025)
2025-05-07T20:33:06.2264979Z     
2025-05-07T20:33:06.2265258Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.2265615Z     
2025-05-07T20:33:06.2265812Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.2266203Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.2266525Z         x = x_sign * x_clamp
2025-05-07T20:33:06.2266780Z         x0 = x[:, :D]
2025-05-07T20:33:06.2267000Z         x1 = x[:, D:]
2025-05-07T20:33:06.2267221Z     
2025-05-07T20:33:06.2267416Z         if contiguous:
2025-05-07T20:33:06.2267648Z             x0 = x0.contiguous()
2025-05-07T20:33:06.2267918Z             x1 = x1.contiguous()
2025-05-07T20:33:06.2268166Z     
2025-05-07T20:33:06.2268366Z         if scale_ub is not None:
2025-05-07T20:33:06.2268661Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.2269015Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.2269329Z             )
2025-05-07T20:33:06.2269530Z         else:
2025-05-07T20:33:06.2269751Z             scale_ub_tensor = None
2025-05-07T20:33:06.2270105Z     
2025-05-07T20:33:06.2270346Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.2270673Z             op = silu_mul_quant
2025-05-07T20:33:06.2270931Z             if compiled:
2025-05-07T20:33:06.2271188Z                 op = torch.compile(op)
2025-05-07T20:33:06.2271514Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.2271830Z     
2025-05-07T20:33:06.2272028Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.2272202Z 
2025-05-07T20:33:06.2272309Z moe/activation_test.py:117: 
2025-05-07T20:33:06.2272617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.2272956Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.2273252Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.2273831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:06.2274393Z     return fn(*args, **kwargs)
2025-05-07T20:33:06.2275061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.2275766Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.2276316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.2276997Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.2277669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.2278271Z     kernel = self.compile(
2025-05-07T20:33:06.2278819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.2279522Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.2279936Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.2280169Z 
2025-05-07T20:33:06.2280387Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917334400>
2025-05-07T20:33:06.2281531Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.2282975Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917497550>}
2025-05-07T20:33:06.2284334Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.2285372Z context = <triton._C.libtriton.ir.context object at 0x7f891705cab0>
2025-05-07T20:33:06.2285661Z 
2025-05-07T20:33:06.2285842Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.2286367Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.2286889Z                            module_map=module_map)
2025-05-07T20:33:06.2287270Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.2287627Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.2287899Z E       ^
2025-05-07T20:33:06.2288377Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.2288834Z 
2025-05-07T20:33:06.2289262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.2289777Z 
2025-05-07T20:33:06.2289887Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.2290316Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.2290724Z     T=16384,
2025-05-07T20:33:06.2290922Z     D=5120,
2025-05-07T20:33:06.2291129Z     scale_ub=None,
2025-05-07T20:33:06.2291354Z     contiguous=False,
2025-05-07T20:33:06.2291594Z     compiled=False,
2025-05-07T20:33:06.2291798Z )
2025-05-07T20:33:06.2292123Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.2292626Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:06.2292906Z 
2025-05-07T20:33:06.2292986Z     @given(
2025-05-07T20:33:06.2293224Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.2293547Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.2293856Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.2294195Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.2294535Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.2294825Z     )
2025-05-07T20:33:06.2295183Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.2295633Z     def test_silu_mul_quant(
2025-05-07T20:33:06.2295883Z         self,
2025-05-07T20:33:06.2296078Z         T: int,
2025-05-07T20:33:06.2296287Z         D: int,
2025-05-07T20:33:06.2296515Z         scale_ub: Optional[float],
2025-05-07T20:33:06.2296789Z         contiguous: bool,
2025-05-07T20:33:06.2297038Z         compiled: bool,
2025-05-07T20:33:06.2297274Z     ) -> None:
2025-05-07T20:33:06.2297493Z         torch.manual_seed(2025)
2025-05-07T20:33:06.2297747Z     
2025-05-07T20:33:06.2298030Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.2298450Z     
2025-05-07T20:33:06.2298654Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.2298955Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.2301032Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:06.2302937Z 
2025-05-07T20:33:06.2303072Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:06.2303291Z 
2025-05-07T20:33:06.2303400Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.2304113Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.2304531Z     T=4096,
2025-05-07T20:33:06.2304719Z     D=7168,
2025-05-07T20:33:06.2304921Z     scale_ub=1200.0,
2025-05-07T20:33:06.2305157Z     contiguous=True,
2025-05-07T20:33:06.2305378Z     compiled=True,
2025-05-07T20:33:06.2305589Z )
2025-05-07T20:33:06.2305915Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.2306404Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:06.2306690Z 
2025-05-07T20:33:06.2306877Z     @given(
2025-05-07T20:33:06.2307118Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.2307436Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.2307744Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.2308091Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.2308431Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.2308717Z     )
2025-05-07T20:33:06.2309072Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.2309521Z     def test_silu_mul_quant(
2025-05-07T20:33:06.2309761Z         self,
2025-05-07T20:33:06.2310016Z         T: int,
2025-05-07T20:33:06.2310221Z         D: int,
2025-05-07T20:33:06.2310439Z         scale_ub: Optional[float],
2025-05-07T20:33:06.2310723Z         contiguous: bool,
2025-05-07T20:33:06.2310976Z         compiled: bool,
2025-05-07T20:33:06.2311214Z     ) -> None:
2025-05-07T20:33:06.2311433Z         torch.manual_seed(2025)
2025-05-07T20:33:06.2311690Z     
2025-05-07T20:33:06.2311970Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.2312310Z     
2025-05-07T20:33:06.2312513Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.2312815Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.2314991Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:06.2316863Z 
2025-05-07T20:33:06.2316987Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:06.2317212Z 
2025-05-07T20:33:06.2317313Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.2317727Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.2318127Z     T=16384,
2025-05-07T20:33:06.2318316Z     D=7168,
2025-05-07T20:33:06.2318511Z     scale_ub=None,
2025-05-07T20:33:06.2318732Z     contiguous=False,
2025-05-07T20:33:06.2319027Z     compiled=False,
2025-05-07T20:33:06.2319234Z )
2025-05-07T20:33:06.3370833Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.3371870Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:06.3372549Z 
2025-05-07T20:33:06.3372754Z     @given(
2025-05-07T20:33:06.3373232Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.3386039Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.3386399Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.3386913Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.3387256Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.3387576Z     )
2025-05-07T20:33:06.3387929Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.3388381Z     def test_silu_mul_quant(
2025-05-07T20:33:06.3388636Z         self,
2025-05-07T20:33:06.3388838Z         T: int,
2025-05-07T20:33:06.3389045Z         D: int,
2025-05-07T20:33:06.3389273Z         scale_ub: Optional[float],
2025-05-07T20:33:06.3389545Z         contiguous: bool,
2025-05-07T20:33:06.3389876Z         compiled: bool,
2025-05-07T20:33:06.3390117Z     ) -> None:
2025-05-07T20:33:06.3390341Z         torch.manual_seed(2025)
2025-05-07T20:33:06.3390598Z     
2025-05-07T20:33:06.3390881Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.3393050Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:06.3394945Z 
2025-05-07T20:33:06.3395078Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:06.3395296Z 
2025-05-07T20:33:06.3395402Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.3395831Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.3396242Z     T=2048,
2025-05-07T20:33:06.3396438Z     D=7168,
2025-05-07T20:33:06.3396645Z     scale_ub=1200.0,
2025-05-07T20:33:06.3396880Z     contiguous=True,
2025-05-07T20:33:06.3397114Z     compiled=True,
2025-05-07T20:33:06.3397330Z )
2025-05-07T20:33:06.3397660Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.3398172Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:06.3398444Z 
2025-05-07T20:33:06.3398524Z     @given(
2025-05-07T20:33:06.3398767Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.3399094Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.3399405Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.3399746Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.3400085Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.3400378Z     )
2025-05-07T20:33:06.3400737Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.3401189Z     def test_silu_mul_quant(
2025-05-07T20:33:06.3401444Z         self,
2025-05-07T20:33:06.3401641Z         T: int,
2025-05-07T20:33:06.3401852Z         D: int,
2025-05-07T20:33:06.3402086Z         scale_ub: Optional[float],
2025-05-07T20:33:06.3402359Z         contiguous: bool,
2025-05-07T20:33:06.3402609Z         compiled: bool,
2025-05-07T20:33:06.3402848Z     ) -> None:
2025-05-07T20:33:06.3403066Z         torch.manual_seed(2025)
2025-05-07T20:33:06.3403318Z     
2025-05-07T20:33:06.3403600Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.3404294Z     
2025-05-07T20:33:06.3404497Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.3404800Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.3406919Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:06.3408839Z 
2025-05-07T20:33:06.3408975Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:06.3409192Z 
2025-05-07T20:33:06.3409300Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.3409727Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.3410141Z     T=2048,
2025-05-07T20:33:06.3410338Z     D=7168,
2025-05-07T20:33:06.3410529Z     scale_ub=None,
2025-05-07T20:33:06.3410750Z     contiguous=True,
2025-05-07T20:33:06.3410986Z     compiled=False,
2025-05-07T20:33:06.3411191Z )
2025-05-07T20:33:06.3411515Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.3412016Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:06.3412290Z 
2025-05-07T20:33:06.3412374Z     @given(
2025-05-07T20:33:06.3412672Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.3412991Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.3413307Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.3413636Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.3413976Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.3414274Z     )
2025-05-07T20:33:06.3414620Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.3415068Z     def test_silu_mul_quant(
2025-05-07T20:33:06.3415317Z         self,
2025-05-07T20:33:06.3415511Z         T: int,
2025-05-07T20:33:06.3415721Z         D: int,
2025-05-07T20:33:06.3415947Z         scale_ub: Optional[float],
2025-05-07T20:33:06.3416224Z         contiguous: bool,
2025-05-07T20:33:06.3416462Z         compiled: bool,
2025-05-07T20:33:06.3416691Z     ) -> None:
2025-05-07T20:33:06.3416912Z         torch.manual_seed(2025)
2025-05-07T20:33:06.3417160Z     
2025-05-07T20:33:06.3417435Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.3417781Z     
2025-05-07T20:33:06.3417974Z >       x_sign = torch.sign(x)
2025-05-07T20:33:06.3419943Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:06.3421876Z 
2025-05-07T20:33:06.3421998Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:06.3422219Z 
2025-05-07T20:33:06.3422324Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.3422753Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.3423152Z     T=1,
2025-05-07T20:33:06.3423341Z     D=7168,
2025-05-07T20:33:06.3423540Z     scale_ub=1200.0,
2025-05-07T20:33:06.3423765Z     contiguous=True,
2025-05-07T20:33:06.3423995Z     compiled=False,
2025-05-07T20:33:06.3424207Z )
2025-05-07T20:33:06.6714158Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.6714928Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:06.6715295Z 
2025-05-07T20:33:06.6715407Z     @given(
2025-05-07T20:33:06.6715976Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.6716312Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.6716634Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.6716967Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.6717497Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.6717912Z     )
2025-05-07T20:33:06.6718266Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.6718719Z     def test_silu_mul_quant(
2025-05-07T20:33:06.6718976Z         self,
2025-05-07T20:33:06.6719173Z         T: int,
2025-05-07T20:33:06.6719383Z         D: int,
2025-05-07T20:33:06.6719613Z         scale_ub: Optional[float],
2025-05-07T20:33:06.6719895Z         contiguous: bool,
2025-05-07T20:33:06.6720148Z         compiled: bool,
2025-05-07T20:33:06.6720391Z     ) -> None:
2025-05-07T20:33:06.6720611Z         torch.manual_seed(2025)
2025-05-07T20:33:06.6720867Z     
2025-05-07T20:33:06.6721156Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.6721509Z     
2025-05-07T20:33:06.6721709Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.6722015Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.6722337Z         x = x_sign * x_clamp
2025-05-07T20:33:06.6722678Z         x0 = x[:, :D]
2025-05-07T20:33:06.6722909Z         x1 = x[:, D:]
2025-05-07T20:33:06.6723128Z     
2025-05-07T20:33:06.6723316Z         if contiguous:
2025-05-07T20:33:06.6723557Z             x0 = x0.contiguous()
2025-05-07T20:33:06.6723828Z             x1 = x1.contiguous()
2025-05-07T20:33:06.6724073Z     
2025-05-07T20:33:06.6724278Z         if scale_ub is not None:
2025-05-07T20:33:06.6724567Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.6724912Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.6725230Z             )
2025-05-07T20:33:06.6725434Z         else:
2025-05-07T20:33:06.6725652Z             scale_ub_tensor = None
2025-05-07T20:33:06.6725917Z     
2025-05-07T20:33:06.6726161Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.6726482Z             op = silu_mul_quant
2025-05-07T20:33:06.6726746Z             if compiled:
2025-05-07T20:33:06.6727007Z                 op = torch.compile(op)
2025-05-07T20:33:06.6727325Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.6727605Z     
2025-05-07T20:33:06.6727809Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.6727978Z 
2025-05-07T20:33:06.6728093Z moe/activation_test.py:117: 
2025-05-07T20:33:06.6728397Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.6728745Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.6729043Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.6729744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.6730452Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.6731003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.6731717Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.6732424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.6732972Z     kernel = self.compile(
2025-05-07T20:33:06.6733529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.6734197Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.6734681Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.6734925Z 
2025-05-07T20:33:06.6735134Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917117040>
2025-05-07T20:33:06.6736273Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.6737680Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89170df040>}
2025-05-07T20:33:06.6739066Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.6740100Z context = <triton._C.libtriton.ir.context object at 0x7f89170cdd70>
2025-05-07T20:33:06.6740400Z 
2025-05-07T20:33:06.6740573Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.6741112Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.6741587Z                            module_map=module_map)
2025-05-07T20:33:06.6741966Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.6742381Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.6742649Z E       ^
2025-05-07T20:33:06.6743178Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.6743647Z 
2025-05-07T20:33:06.6744067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.6744583Z 
2025-05-07T20:33:06.6744698Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.6745120Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.6745535Z     T=128,
2025-05-07T20:33:06.6745750Z     D=5120,
2025-05-07T20:33:06.6745956Z     scale_ub=None,
2025-05-07T20:33:06.6746175Z     contiguous=True,
2025-05-07T20:33:06.6746414Z     compiled=False,
2025-05-07T20:33:06.6746638Z )
2025-05-07T20:33:06.6746964Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.6747468Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:06.6747751Z 
2025-05-07T20:33:06.6747832Z     @given(
2025-05-07T20:33:06.6748079Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.6748402Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.6748724Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.6749068Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.6749402Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.6749701Z     )
2025-05-07T20:33:06.6750154Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.6750602Z     def test_silu_mul_quant(
2025-05-07T20:33:06.6750859Z         self,
2025-05-07T20:33:06.6751064Z         T: int,
2025-05-07T20:33:06.6751268Z         D: int,
2025-05-07T20:33:06.6751502Z         scale_ub: Optional[float],
2025-05-07T20:33:06.6751783Z         contiguous: bool,
2025-05-07T20:33:06.6752025Z         compiled: bool,
2025-05-07T20:33:06.6752258Z     ) -> None:
2025-05-07T20:33:06.6752485Z         torch.manual_seed(2025)
2025-05-07T20:33:06.6752738Z     
2025-05-07T20:33:06.6753015Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.6753367Z     
2025-05-07T20:33:06.6753574Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.6753869Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.6754194Z         x = x_sign * x_clamp
2025-05-07T20:33:06.6754454Z         x0 = x[:, :D]
2025-05-07T20:33:06.6754756Z         x1 = x[:, D:]
2025-05-07T20:33:06.6754972Z     
2025-05-07T20:33:06.6755162Z         if contiguous:
2025-05-07T20:33:06.6755396Z             x0 = x0.contiguous()
2025-05-07T20:33:06.6755665Z             x1 = x1.contiguous()
2025-05-07T20:33:06.6755914Z     
2025-05-07T20:33:06.6756148Z         if scale_ub is not None:
2025-05-07T20:33:06.6756436Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.6756781Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.6757093Z             )
2025-05-07T20:33:06.6757292Z         else:
2025-05-07T20:33:06.6757514Z             scale_ub_tensor = None
2025-05-07T20:33:06.6757810Z     
2025-05-07T20:33:06.6758050Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.6758372Z             op = silu_mul_quant
2025-05-07T20:33:06.6758634Z             if compiled:
2025-05-07T20:33:06.6758886Z                 op = torch.compile(op)
2025-05-07T20:33:06.6759196Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.6759485Z     
2025-05-07T20:33:06.6759678Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.6759858Z 
2025-05-07T20:33:06.6759961Z moe/activation_test.py:117: 
2025-05-07T20:33:06.6760266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.6760610Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.6760903Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.6761601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.6762342Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.6762890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.6763584Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.6764259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.6764795Z     kernel = self.compile(
2025-05-07T20:33:06.6765345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.6766009Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.6766420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.6766651Z 
2025-05-07T20:33:06.6766859Z self = <triton.compiler.compiler.ASTSource object at 0x7f89170d2fd0>
2025-05-07T20:33:06.6767958Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.6769350Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89170dfa60>}
2025-05-07T20:33:06.6770700Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
﻿2025-05-07T20:33:06.6774467Z context = <triton._C.libtriton.ir.context object at 0x7f8916fa85b0>
2025-05-07T20:33:06.6774762Z 
2025-05-07T20:33:06.6774932Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.6775456Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.6775934Z                            module_map=module_map)
2025-05-07T20:33:06.6776295Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.6776655Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.6776919Z E       ^
2025-05-07T20:33:06.6777391Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.6777848Z 
2025-05-07T20:33:06.6778263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.6778778Z 
2025-05-07T20:33:06.6778880Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.6779381Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.6779781Z     T=128,
2025-05-07T20:33:06.6779968Z     D=7168,
2025-05-07T20:33:06.6780164Z     scale_ub=None,
2025-05-07T20:33:06.6780375Z     contiguous=True,
2025-05-07T20:33:06.6780603Z     compiled=False,
2025-05-07T20:33:06.6780859Z )
2025-05-07T20:33:06.7678713Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.7679424Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:06.7679805Z 
2025-05-07T20:33:06.7679917Z     @given(
2025-05-07T20:33:06.7680225Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.7680652Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.7681059Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.7681394Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.7681723Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.7682029Z     )
2025-05-07T20:33:06.7682387Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.7682826Z     def test_silu_mul_quant(
2025-05-07T20:33:06.7683081Z         self,
2025-05-07T20:33:06.7683283Z         T: int,
2025-05-07T20:33:06.7683486Z         D: int,
2025-05-07T20:33:06.7683952Z         scale_ub: Optional[float],
2025-05-07T20:33:06.7684241Z         contiguous: bool,
2025-05-07T20:33:06.7684489Z         compiled: bool,
2025-05-07T20:33:06.7684710Z     ) -> None:
2025-05-07T20:33:06.7684929Z         torch.manual_seed(2025)
2025-05-07T20:33:06.7685175Z     
2025-05-07T20:33:06.7685444Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.7685788Z     
2025-05-07T20:33:06.7685983Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.7686272Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.7686585Z         x = x_sign * x_clamp
2025-05-07T20:33:06.7686833Z         x0 = x[:, :D]
2025-05-07T20:33:06.7687043Z         x1 = x[:, D:]
2025-05-07T20:33:06.7687248Z     
2025-05-07T20:33:06.7687432Z         if contiguous:
2025-05-07T20:33:06.7687655Z             x0 = x0.contiguous()
2025-05-07T20:33:06.7687915Z             x1 = x1.contiguous()
2025-05-07T20:33:06.7688156Z     
2025-05-07T20:33:06.7688348Z         if scale_ub is not None:
2025-05-07T20:33:06.7688642Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.7688980Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.7689289Z             )
2025-05-07T20:33:06.7689481Z         else:
2025-05-07T20:33:06.7689686Z             scale_ub_tensor = None
2025-05-07T20:33:06.7689938Z     
2025-05-07T20:33:06.7690169Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.7690476Z             op = silu_mul_quant
2025-05-07T20:33:06.7690731Z             if compiled:
2025-05-07T20:33:06.7690980Z                 op = torch.compile(op)
2025-05-07T20:33:06.7691421Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.7691696Z     
2025-05-07T20:33:06.7691888Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.7692052Z 
2025-05-07T20:33:06.7692154Z moe/activation_test.py:117: 
2025-05-07T20:33:06.7692449Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.7692788Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.7693062Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.7693751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.7694442Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.7694982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.7695655Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.7696397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.7696937Z     kernel = self.compile(
2025-05-07T20:33:06.7697483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.7698129Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.7698595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.7698824Z 
2025-05-07T20:33:06.7699033Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917097100>
2025-05-07T20:33:06.7700119Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.7701508Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f891735e790>}
2025-05-07T20:33:06.7702857Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.7704247Z context = <triton._C.libtriton.ir.context object at 0x7f89170067b0>
2025-05-07T20:33:06.7704539Z 
2025-05-07T20:33:06.7704712Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.7705226Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.7705697Z                            module_map=module_map)
2025-05-07T20:33:06.7706065Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.7706421Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.7706675Z E       ^
2025-05-07T20:33:06.7707147Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.7707601Z 
2025-05-07T20:33:06.7708024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.7708540Z 
2025-05-07T20:33:06.7708642Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.7709063Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.7709468Z     T=2048,
2025-05-07T20:33:06.7709655Z     D=7168,
2025-05-07T20:33:06.7709930Z     scale_ub=1200.0,
2025-05-07T20:33:06.7710153Z     contiguous=True,
2025-05-07T20:33:06.7710375Z     compiled=False,
2025-05-07T20:33:06.7710577Z )
2025-05-07T20:33:06.7710890Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.7711380Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:06.7711651Z 
2025-05-07T20:33:06.7711729Z     @given(
2025-05-07T20:33:06.7711960Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.7712356Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.7712654Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.7712982Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.7713308Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.7713592Z     )
2025-05-07T20:33:06.7713932Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.7714371Z     def test_silu_mul_quant(
2025-05-07T20:33:06.7714606Z         self,
2025-05-07T20:33:06.7714788Z         T: int,
2025-05-07T20:33:06.7714981Z         D: int,
2025-05-07T20:33:06.7715196Z         scale_ub: Optional[float],
2025-05-07T20:33:06.7715462Z         contiguous: bool,
2025-05-07T20:33:06.7715693Z         compiled: bool,
2025-05-07T20:33:06.7715909Z     ) -> None:
2025-05-07T20:33:06.7716116Z         torch.manual_seed(2025)
2025-05-07T20:33:06.7716352Z     
2025-05-07T20:33:06.7716700Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.7718782Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:06.7720690Z 
2025-05-07T20:33:06.7720815Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:06.7721026Z 
2025-05-07T20:33:06.7721127Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.7721537Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.7721942Z     T=1,
2025-05-07T20:33:06.7722128Z     D=5120,
2025-05-07T20:33:06.7722320Z     scale_ub=1200.0,
2025-05-07T20:33:06.7731003Z     contiguous=True,
2025-05-07T20:33:06.7731287Z     compiled=False,
2025-05-07T20:33:06.7731490Z )
2025-05-07T20:33:06.8209894Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.8210855Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:06.8211227Z 
2025-05-07T20:33:06.8211340Z     @given(
2025-05-07T20:33:06.8211647Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.8212036Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.8212404Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.8212738Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.8213074Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.8213366Z     )
2025-05-07T20:33:06.8213722Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.8214177Z     def test_silu_mul_quant(
2025-05-07T20:33:06.8214427Z         self,
2025-05-07T20:33:06.8214629Z         T: int,
2025-05-07T20:33:06.8214826Z         D: int,
2025-05-07T20:33:06.8215055Z         scale_ub: Optional[float],
2025-05-07T20:33:06.8215339Z         contiguous: bool,
2025-05-07T20:33:06.8215588Z         compiled: bool,
2025-05-07T20:33:06.8215826Z     ) -> None:
2025-05-07T20:33:06.8216056Z         torch.manual_seed(2025)
2025-05-07T20:33:06.8216302Z     
2025-05-07T20:33:06.8216587Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.8216936Z     
2025-05-07T20:33:06.8217128Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.8217427Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.8217751Z         x = x_sign * x_clamp
2025-05-07T20:33:06.8218028Z         x0 = x[:, :D]
2025-05-07T20:33:06.8218254Z         x1 = x[:, D:]
2025-05-07T20:33:06.8218465Z     
2025-05-07T20:33:06.8218772Z         if contiguous:
2025-05-07T20:33:06.8219021Z             x0 = x0.contiguous()
2025-05-07T20:33:06.8219289Z             x1 = x1.contiguous()
2025-05-07T20:33:06.8219539Z     
2025-05-07T20:33:06.8219737Z         if scale_ub is not None:
2025-05-07T20:33:06.8220016Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.8220365Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.8220681Z             )
2025-05-07T20:33:06.8220873Z         else:
2025-05-07T20:33:06.8221092Z             scale_ub_tensor = None
2025-05-07T20:33:06.8221351Z     
2025-05-07T20:33:06.8221593Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.8221911Z             op = silu_mul_quant
2025-05-07T20:33:06.8222168Z             if compiled:
2025-05-07T20:33:06.8222426Z                 op = torch.compile(op)
2025-05-07T20:33:06.8222726Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.8223007Z     
2025-05-07T20:33:06.8223208Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.8223460Z 
2025-05-07T20:33:06.8223565Z moe/activation_test.py:117: 
2025-05-07T20:33:06.8223873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.8224211Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.8224493Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.8225267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.8225971Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.8226518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.8227199Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.8227871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.8228407Z     kernel = self.compile(
2025-05-07T20:33:06.8228957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.8229617Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.8230110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.8230399Z 
2025-05-07T20:33:06.8230609Z self = <triton.compiler.compiler.ASTSource object at 0x7f8916ff9fd0>
2025-05-07T20:33:06.8231704Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.8233139Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8916f15040>}
2025-05-07T20:33:06.8234487Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.8235524Z context = <triton._C.libtriton.ir.context object at 0x7f8916f18970>
2025-05-07T20:33:06.8235811Z 
2025-05-07T20:33:06.8235994Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.8236530Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.8237002Z                            module_map=module_map)
2025-05-07T20:33:06.8237384Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.8237743Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.8238002Z E       ^
2025-05-07T20:33:06.8238480Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.8238931Z 
2025-05-07T20:33:06.8239359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.8239923Z 
2025-05-07T20:33:06.8240033Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.8240444Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.8240853Z     T=2048,
2025-05-07T20:33:06.8241058Z     D=5120,
2025-05-07T20:33:06.8241253Z     scale_ub=None,
2025-05-07T20:33:06.8241475Z     contiguous=True,
2025-05-07T20:33:06.8241700Z     compiled=False,
2025-05-07T20:33:06.8241906Z )
2025-05-07T20:33:06.8242258Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.8242771Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:06.8243042Z 
2025-05-07T20:33:06.8243125Z     @given(
2025-05-07T20:33:06.8243357Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.8243675Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.8244030Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.8244357Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.8244691Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.8244982Z     )
2025-05-07T20:33:06.8245332Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.8245820Z     def test_silu_mul_quant(
2025-05-07T20:33:06.8246066Z         self,
2025-05-07T20:33:06.8246260Z         T: int,
2025-05-07T20:33:06.8246466Z         D: int,
2025-05-07T20:33:06.8246688Z         scale_ub: Optional[float],
2025-05-07T20:33:06.8246956Z         contiguous: bool,
2025-05-07T20:33:06.8247198Z         compiled: bool,
2025-05-07T20:33:06.8247425Z     ) -> None:
2025-05-07T20:33:06.8247640Z         torch.manual_seed(2025)
2025-05-07T20:33:06.8247892Z     
2025-05-07T20:33:06.8248170Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.8248523Z     
2025-05-07T20:33:06.8248722Z >       x_sign = torch.sign(x)
2025-05-07T20:33:06.8250724Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:06.8252638Z 
2025-05-07T20:33:06.8252759Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:06.8252975Z 
2025-05-07T20:33:06.8253082Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.8253500Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.8253918Z     T=16384,
2025-05-07T20:33:06.8254128Z     D=5120,
2025-05-07T20:33:06.8254335Z     scale_ub=None,
2025-05-07T20:33:06.8254545Z     contiguous=True,
2025-05-07T20:33:06.8254779Z     compiled=False,
2025-05-07T20:33:06.8254992Z )
2025-05-07T20:33:06.8255306Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.8255809Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:06.8256086Z 
2025-05-07T20:33:06.8256172Z     @given(
2025-05-07T20:33:06.8256405Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.8256729Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.8257042Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.8257375Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.8257709Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.8257999Z     )
2025-05-07T20:33:06.8258351Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.8258843Z     def test_silu_mul_quant(
2025-05-07T20:33:06.8259092Z         self,
2025-05-07T20:33:06.8259287Z         T: int,
2025-05-07T20:33:06.8259484Z         D: int,
2025-05-07T20:33:06.8259703Z         scale_ub: Optional[float],
2025-05-07T20:33:06.8259973Z         contiguous: bool,
2025-05-07T20:33:06.8260211Z         compiled: bool,
2025-05-07T20:33:06.8260447Z     ) -> None:
2025-05-07T20:33:06.8260668Z         torch.manual_seed(2025)
2025-05-07T20:33:06.8260906Z     
2025-05-07T20:33:06.8261187Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.8263271Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:06.8265139Z 
2025-05-07T20:33:06.8265258Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:06.8265472Z 
2025-05-07T20:33:06.8265583Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.8266038Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.8266446Z     T=4096,
2025-05-07T20:33:06.8266639Z     D=5120,
2025-05-07T20:33:06.8266823Z     scale_ub=None,
2025-05-07T20:33:06.8267047Z     contiguous=True,
2025-05-07T20:33:06.8267272Z     compiled=False,
2025-05-07T20:33:06.8267474Z )
2025-05-07T20:33:06.9304818Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.9305507Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:06.9305882Z 
2025-05-07T20:33:06.9305986Z     @given(
2025-05-07T20:33:06.9306314Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.9306732Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.9307032Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.9307363Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.9307836Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.9308130Z     )
2025-05-07T20:33:06.9308470Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.9308906Z     def test_silu_mul_quant(
2025-05-07T20:33:06.9309148Z         self,
2025-05-07T20:33:06.9309334Z         T: int,
2025-05-07T20:33:06.9309529Z         D: int,
2025-05-07T20:33:06.9309745Z         scale_ub: Optional[float],
2025-05-07T20:33:06.9310097Z         contiguous: bool,
2025-05-07T20:33:06.9310335Z         compiled: bool,
2025-05-07T20:33:06.9310562Z     ) -> None:
2025-05-07T20:33:06.9310772Z         torch.manual_seed(2025)
2025-05-07T20:33:06.9311020Z     
2025-05-07T20:33:06.9311329Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.9313491Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:06.9315383Z 
2025-05-07T20:33:06.9315500Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:06.9315718Z 
2025-05-07T20:33:06.9315819Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.9316228Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.9316718Z     T=2048,
2025-05-07T20:33:06.9316909Z     D=5120,
2025-05-07T20:33:06.9317113Z     scale_ub=None,
2025-05-07T20:33:06.9317324Z     contiguous=False,
2025-05-07T20:33:06.9317544Z     compiled=False,
2025-05-07T20:33:06.9317750Z )
2025-05-07T20:33:06.9318068Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.9318563Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:06.9318832Z 
2025-05-07T20:33:06.9318909Z     @given(
2025-05-07T20:33:06.9319138Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.9319448Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.9319747Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.9320074Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.9320403Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.9320679Z     )
2025-05-07T20:33:06.9321135Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.9321577Z     def test_silu_mul_quant(
2025-05-07T20:33:06.9321823Z         self,
2025-05-07T20:33:06.9322013Z         T: int,
2025-05-07T20:33:06.9322208Z         D: int,
2025-05-07T20:33:06.9322422Z         scale_ub: Optional[float],
2025-05-07T20:33:06.9322758Z         contiguous: bool,
2025-05-07T20:33:06.9322995Z         compiled: bool,
2025-05-07T20:33:06.9323212Z     ) -> None:
2025-05-07T20:33:06.9323419Z         torch.manual_seed(2025)
2025-05-07T20:33:06.9323657Z     
2025-05-07T20:33:06.9323929Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.9325978Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:06.9327856Z 
2025-05-07T20:33:06.9327973Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:06.9328237Z 
2025-05-07T20:33:06.9328340Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.9328751Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.9329148Z     T=4096,
2025-05-07T20:33:06.9329327Z     D=7168,
2025-05-07T20:33:06.9329514Z     scale_ub=None,
2025-05-07T20:33:06.9329725Z     contiguous=True,
2025-05-07T20:33:06.9329941Z     compiled=True,
2025-05-07T20:33:06.9330138Z )
2025-05-07T20:33:06.9330449Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.9330926Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:06.9331205Z 
2025-05-07T20:33:06.9331281Z     @given(
2025-05-07T20:33:06.9331511Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.9331818Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.9332118Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.9332446Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.9332776Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.9333056Z     )
2025-05-07T20:33:06.9333403Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.9333842Z     def test_silu_mul_quant(
2025-05-07T20:33:06.9334072Z         self,
2025-05-07T20:33:06.9334267Z         T: int,
2025-05-07T20:33:06.9334461Z         D: int,
2025-05-07T20:33:06.9334672Z         scale_ub: Optional[float],
2025-05-07T20:33:06.9334942Z         contiguous: bool,
2025-05-07T20:33:06.9335180Z         compiled: bool,
2025-05-07T20:33:06.9335396Z     ) -> None:
2025-05-07T20:33:06.9335670Z         torch.manual_seed(2025)
2025-05-07T20:33:06.9335913Z     
2025-05-07T20:33:06.9336175Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.9338227Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:06.9340113Z 
2025-05-07T20:33:06.9340230Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:06.9340453Z 
2025-05-07T20:33:06.9340555Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.9341010Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.9341406Z     T=2048,
2025-05-07T20:33:06.9341592Z     D=5120,
2025-05-07T20:33:06.9341784Z     scale_ub=1200.0,
2025-05-07T20:33:06.9342001Z     contiguous=False,
2025-05-07T20:33:06.9342226Z     compiled=False,
2025-05-07T20:33:06.9342429Z )
2025-05-07T20:33:06.9342788Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.9343274Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:06.9343550Z 
2025-05-07T20:33:06.9343624Z     @given(
2025-05-07T20:33:06.9343849Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.9344151Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.9344453Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.9344779Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.9345097Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.9345387Z     )
2025-05-07T20:33:06.9345730Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.9346166Z     def test_silu_mul_quant(
2025-05-07T20:33:06.9346401Z         self,
2025-05-07T20:33:06.9346591Z         T: int,
2025-05-07T20:33:06.9346784Z         D: int,
2025-05-07T20:33:06.9347045Z         scale_ub: Optional[float],
2025-05-07T20:33:06.9347314Z         contiguous: bool,
2025-05-07T20:33:06.9347550Z         compiled: bool,
2025-05-07T20:33:06.9347764Z     ) -> None:
2025-05-07T20:33:06.9347973Z         torch.manual_seed(2025)
2025-05-07T20:33:06.9348211Z     
2025-05-07T20:33:06.9348475Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.9350594Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:06.9352525Z 
2025-05-07T20:33:06.9352642Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:06.9352859Z 
2025-05-07T20:33:06.9352958Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.9353365Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.9353758Z     T=4096,
2025-05-07T20:33:06.9353942Z     D=7168,
2025-05-07T20:33:06.9354130Z     scale_ub=1200.0,
2025-05-07T20:33:06.9354342Z     contiguous=True,
2025-05-07T20:33:06.9354562Z     compiled=False,
2025-05-07T20:33:06.9354768Z )
2025-05-07T20:33:06.9355075Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.9355686Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:06.9355957Z 
2025-05-07T20:33:06.9356039Z     @given(
2025-05-07T20:33:06.9356268Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.9356572Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.9356877Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.9357206Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.9357528Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.9357813Z     )
2025-05-07T20:33:06.9358161Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.9358595Z     def test_silu_mul_quant(
2025-05-07T20:33:06.9358837Z         self,
2025-05-07T20:33:06.9359033Z         T: int,
2025-05-07T20:33:06.9359222Z         D: int,
2025-05-07T20:33:06.9359444Z         scale_ub: Optional[float],
2025-05-07T20:33:06.9359713Z         contiguous: bool,
2025-05-07T20:33:06.9359993Z         compiled: bool,
2025-05-07T20:33:06.9360216Z     ) -> None:
2025-05-07T20:33:06.9360425Z         torch.manual_seed(2025)
2025-05-07T20:33:06.9360659Z     
2025-05-07T20:33:06.9360926Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.9362978Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:06.9364900Z 
2025-05-07T20:33:06.9365016Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:06.9365229Z 
2025-05-07T20:33:06.9365342Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.9365745Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.9366143Z     T=16384,
2025-05-07T20:33:06.9366332Z     D=7168,
2025-05-07T20:33:06.9366515Z     scale_ub=None,
2025-05-07T20:33:06.9366728Z     contiguous=False,
2025-05-07T20:33:06.9366993Z     compiled=True,
2025-05-07T20:33:06.9367194Z )
2025-05-07T20:33:07.0669582Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.0670373Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:07.0670751Z 
2025-05-07T20:33:07.0670862Z     @given(
2025-05-07T20:33:07.0671186Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.0671497Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.0671801Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.0672134Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.0672497Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.0672778Z     )
2025-05-07T20:33:07.0673130Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.0673570Z     def test_silu_mul_quant(
2025-05-07T20:33:07.0673809Z         self,
2025-05-07T20:33:07.0674019Z         T: int,
2025-05-07T20:33:07.0674223Z         D: int,
2025-05-07T20:33:07.0674442Z         scale_ub: Optional[float],
2025-05-07T20:33:07.0674711Z         contiguous: bool,
2025-05-07T20:33:07.0674955Z         compiled: bool,
2025-05-07T20:33:07.0675186Z     ) -> None:
2025-05-07T20:33:07.0675396Z         torch.manual_seed(2025)
2025-05-07T20:33:07.0675640Z     
2025-05-07T20:33:07.0675916Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.0677992Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.0680185Z 
2025-05-07T20:33:07.0680304Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.0680522Z 
2025-05-07T20:33:07.0680622Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.0681036Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.0681467Z     T=4096,
2025-05-07T20:33:07.0681674Z     D=7168,
2025-05-07T20:33:07.0681866Z     scale_ub=None,
2025-05-07T20:33:07.0682084Z     contiguous=True,
2025-05-07T20:33:07.0682301Z     compiled=False,
2025-05-07T20:33:07.0682511Z )
2025-05-07T20:33:07.0682907Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.0683397Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.0683670Z 
2025-05-07T20:33:07.0683747Z     @given(
2025-05-07T20:33:07.0683980Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.0684363Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.0684668Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.0684997Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.0685321Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.0685599Z     )
2025-05-07T20:33:07.0685949Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.0686386Z     def test_silu_mul_quant(
2025-05-07T20:33:07.0686622Z         self,
2025-05-07T20:33:07.0686817Z         T: int,
2025-05-07T20:33:07.0687018Z         D: int,
2025-05-07T20:33:07.0687230Z         scale_ub: Optional[float],
2025-05-07T20:33:07.0687508Z         contiguous: bool,
2025-05-07T20:33:07.0687746Z         compiled: bool,
2025-05-07T20:33:07.0687964Z     ) -> None:
2025-05-07T20:33:07.0688177Z         torch.manual_seed(2025)
2025-05-07T20:33:07.0688418Z     
2025-05-07T20:33:07.0688682Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.0690820Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.0692727Z 
2025-05-07T20:33:07.0692847Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.0693066Z 
2025-05-07T20:33:07.0693168Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.0693579Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.0693976Z     T=16384,
2025-05-07T20:33:07.0694172Z     D=7168,
2025-05-07T20:33:07.0694369Z     scale_ub=None,
2025-05-07T20:33:07.0694578Z     contiguous=True,
2025-05-07T20:33:07.0703212Z     compiled=False,
2025-05-07T20:33:07.0703431Z )
2025-05-07T20:33:07.0704018Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.0704535Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.0704814Z 
2025-05-07T20:33:07.0704891Z     @given(
2025-05-07T20:33:07.0705124Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.0705444Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.0705748Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.0706217Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.0706553Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.0706848Z     )
2025-05-07T20:33:07.0707197Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.0707644Z     def test_silu_mul_quant(
2025-05-07T20:33:07.0707901Z         self,
2025-05-07T20:33:07.0708094Z         T: int,
2025-05-07T20:33:07.0708293Z         D: int,
2025-05-07T20:33:07.0708516Z         scale_ub: Optional[float],
2025-05-07T20:33:07.0708785Z         contiguous: bool,
2025-05-07T20:33:07.0709030Z         compiled: bool,
2025-05-07T20:33:07.0709261Z     ) -> None:
2025-05-07T20:33:07.0709478Z         torch.manual_seed(2025)
2025-05-07T20:33:07.0709730Z     
2025-05-07T20:33:07.0710072Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.0712260Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.0714247Z 
2025-05-07T20:33:07.0714374Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.0714585Z 
2025-05-07T20:33:07.0714684Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.0715102Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.0715510Z     T=16384,
2025-05-07T20:33:07.0715694Z     D=7168,
2025-05-07T20:33:07.0715883Z     scale_ub=1200.0,
2025-05-07T20:33:07.0716105Z     contiguous=True,
2025-05-07T20:33:07.0716318Z     compiled=False,
2025-05-07T20:33:07.0716530Z )
2025-05-07T20:33:07.0716847Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.0717337Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.0717625Z 
2025-05-07T20:33:07.0717697Z     @given(
2025-05-07T20:33:07.0717990Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.0718303Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.0718604Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.0718934Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.0719261Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.0719538Z     )
2025-05-07T20:33:07.0719887Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.0720327Z     def test_silu_mul_quant(
2025-05-07T20:33:07.0720563Z         self,
2025-05-07T20:33:07.0720753Z         T: int,
2025-05-07T20:33:07.0720951Z         D: int,
2025-05-07T20:33:07.0721162Z         scale_ub: Optional[float],
2025-05-07T20:33:07.0721438Z         contiguous: bool,
2025-05-07T20:33:07.0721676Z         compiled: bool,
2025-05-07T20:33:07.0721900Z     ) -> None:
2025-05-07T20:33:07.0722106Z         torch.manual_seed(2025)
2025-05-07T20:33:07.0722350Z     
2025-05-07T20:33:07.0722627Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.0724713Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.0726656Z 
2025-05-07T20:33:07.0726773Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.0726991Z 
2025-05-07T20:33:07.0727093Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.0727507Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.0727910Z     T=128,
2025-05-07T20:33:07.0728089Z     D=5120,
2025-05-07T20:33:07.0728275Z     scale_ub=1200.0,
2025-05-07T20:33:07.0728503Z     contiguous=False,
2025-05-07T20:33:07.0728718Z     compiled=False,
2025-05-07T20:33:07.0728924Z )
2025-05-07T20:33:07.2350125Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.2350893Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:07.2351269Z 
2025-05-07T20:33:07.2351379Z     @given(
2025-05-07T20:33:07.2351670Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.2351990Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.2352605Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.2352943Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.2353278Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.2353573Z     )
2025-05-07T20:33:07.2353927Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.2354460Z     def test_silu_mul_quant(
2025-05-07T20:33:07.2354708Z         self,
2025-05-07T20:33:07.2354903Z         T: int,
2025-05-07T20:33:07.2355105Z         D: int,
2025-05-07T20:33:07.2355328Z         scale_ub: Optional[float],
2025-05-07T20:33:07.2355602Z         contiguous: bool,
2025-05-07T20:33:07.2355838Z         compiled: bool,
2025-05-07T20:33:07.2356069Z     ) -> None:
2025-05-07T20:33:07.2356295Z         torch.manual_seed(2025)
2025-05-07T20:33:07.2356533Z     
2025-05-07T20:33:07.2356810Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.2357158Z     
2025-05-07T20:33:07.2357353Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.2357650Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.2357963Z         x = x_sign * x_clamp
2025-05-07T20:33:07.2358197Z         x0 = x[:, :D]
2025-05-07T20:33:07.2358415Z         x1 = x[:, D:]
2025-05-07T20:33:07.2358626Z     
2025-05-07T20:33:07.2358887Z         if contiguous:
2025-05-07T20:33:07.2359128Z             x0 = x0.contiguous()
2025-05-07T20:33:07.2359392Z             x1 = x1.contiguous()
2025-05-07T20:33:07.2359630Z     
2025-05-07T20:33:07.2359825Z         if scale_ub is not None:
2025-05-07T20:33:07.2360101Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.2360437Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.2360750Z             )
2025-05-07T20:33:07.2360950Z         else:
2025-05-07T20:33:07.2361163Z             scale_ub_tensor = None
2025-05-07T20:33:07.2361426Z     
2025-05-07T20:33:07.2361701Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.2362027Z             op = silu_mul_quant
2025-05-07T20:33:07.2362277Z             if compiled:
2025-05-07T20:33:07.2362528Z                 op = torch.compile(op)
2025-05-07T20:33:07.2362833Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.2363107Z     
2025-05-07T20:33:07.2363305Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.2363475Z 
2025-05-07T20:33:07.2363588Z moe/activation_test.py:117: 
2025-05-07T20:33:07.2363884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.2364222Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.2364510Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.2365214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.2365908Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.2366457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.2367241Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.2367903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.2368447Z     kernel = self.compile(
2025-05-07T20:33:07.2369006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.2369669Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.2370068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.2370302Z 
2025-05-07T20:33:07.2370509Z self = <triton.compiler.compiler.ASTSource object at 0x7f8916e4b160>
2025-05-07T20:33:07.2371662Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.2373133Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8916e4cca0>}
2025-05-07T20:33:07.2374480Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.2375550Z context = <triton._C.libtriton.ir.context object at 0x7f8916d87370>
2025-05-07T20:33:07.2375844Z 
2025-05-07T20:33:07.2376012Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.2376539Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.2377004Z                            module_map=module_map)
2025-05-07T20:33:07.2377382Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.2377740Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.2378005Z E       ^
2025-05-07T20:33:07.2378468Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.2378929Z 
2025-05-07T20:33:07.2379388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.2379908Z 
2025-05-07T20:33:07.2380020Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.2380440Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.2380844Z     T=2048,
2025-05-07T20:33:07.2381034Z     D=7168,
2025-05-07T20:33:07.2381227Z     scale_ub=None,
2025-05-07T20:33:07.2381440Z     contiguous=False,
2025-05-07T20:33:07.2381669Z     compiled=False,
2025-05-07T20:33:07.2381879Z )
2025-05-07T20:33:07.2382195Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.2382705Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.2382978Z 
2025-05-07T20:33:07.2383061Z     @given(
2025-05-07T20:33:07.2383289Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.2383605Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.2383922Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.2384255Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.2384586Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.2384872Z     )
2025-05-07T20:33:07.2385224Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.2385658Z     def test_silu_mul_quant(
2025-05-07T20:33:07.2385902Z         self,
2025-05-07T20:33:07.2386103Z         T: int,
2025-05-07T20:33:07.2386297Z         D: int,
2025-05-07T20:33:07.2386519Z         scale_ub: Optional[float],
2025-05-07T20:33:07.2386802Z         contiguous: bool,
2025-05-07T20:33:07.2387097Z         compiled: bool,
2025-05-07T20:33:07.2387323Z     ) -> None:
2025-05-07T20:33:07.2387544Z         torch.manual_seed(2025)
2025-05-07T20:33:07.2387783Z     
2025-05-07T20:33:07.2388066Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.2390212Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.2392088Z 
2025-05-07T20:33:07.2392211Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.2392426Z 
2025-05-07T20:33:07.2392584Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.2392997Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.2393404Z     T=128,
2025-05-07T20:33:07.2393593Z     D=7168,
2025-05-07T20:33:07.2393780Z     scale_ub=1200.0,
2025-05-07T20:33:07.2394054Z     contiguous=True,
2025-05-07T20:33:07.2394279Z     compiled=True,
2025-05-07T20:33:07.2394485Z )
2025-05-07T20:33:07.2843577Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.2844169Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.2844448Z 
2025-05-07T20:33:07.2844527Z     @given(
2025-05-07T20:33:07.2844761Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.2845073Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.2845372Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.2845725Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.2846066Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.2846349Z     )
2025-05-07T20:33:07.2846698Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.2847157Z     def test_silu_mul_quant(
2025-05-07T20:33:07.2847401Z         self,
2025-05-07T20:33:07.2847767Z         T: int,
2025-05-07T20:33:07.2847975Z         D: int,
2025-05-07T20:33:07.2848202Z         scale_ub: Optional[float],
2025-05-07T20:33:07.2848469Z         contiguous: bool,
2025-05-07T20:33:07.2848712Z         compiled: bool,
2025-05-07T20:33:07.2848941Z     ) -> None:
2025-05-07T20:33:07.2849158Z         torch.manual_seed(2025)
2025-05-07T20:33:07.2849396Z     
2025-05-07T20:33:07.2849667Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.2850009Z     
2025-05-07T20:33:07.2850198Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.2850488Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.2850805Z         x = x_sign * x_clamp
2025-05-07T20:33:07.2851040Z         x0 = x[:, :D]
2025-05-07T20:33:07.2851262Z         x1 = x[:, D:]
2025-05-07T20:33:07.2851496Z     
2025-05-07T20:33:07.2851701Z         if contiguous:
2025-05-07T20:33:07.2851934Z             x0 = x0.contiguous()
2025-05-07T20:33:07.2852207Z             x1 = x1.contiguous()
2025-05-07T20:33:07.2852442Z     
2025-05-07T20:33:07.2852633Z         if scale_ub is not None:
2025-05-07T20:33:07.2852907Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.2853240Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.2853550Z             )
2025-05-07T20:33:07.2853742Z         else:
2025-05-07T20:33:07.2853947Z             scale_ub_tensor = None
2025-05-07T20:33:07.2854200Z     
2025-05-07T20:33:07.2854432Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.2854747Z             op = silu_mul_quant
2025-05-07T20:33:07.2854996Z             if compiled:
2025-05-07T20:33:07.2855336Z                 op = torch.compile(op)
2025-05-07T20:33:07.2855635Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.2855904Z     
2025-05-07T20:33:07.2856095Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.2856260Z 
2025-05-07T20:33:07.2856364Z moe/activation_test.py:117: 
2025-05-07T20:33:07.2856669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.2857007Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.2857290Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.2857847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.2858402Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.2859058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.2859749Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.2860361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.2861046Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.2861708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.2862317Z     kernel = self.compile(
2025-05-07T20:33:07.2862852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.2863506Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.2863905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.2864133Z 
2025-05-07T20:33:07.2864336Z self = <triton.compiler.compiler.ASTSource object at 0x7f8916cd5700>
2025-05-07T20:33:07.2865424Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.2866816Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8916d390d0>}
2025-05-07T20:33:07.2868250Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.2869273Z context = <triton._C.libtriton.ir.context object at 0x7f8916bf1030>
2025-05-07T20:33:07.2869557Z 
2025-05-07T20:33:07.2869725Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.2870356Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.2870820Z                            module_map=module_map)
2025-05-07T20:33:07.2871197Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.2871547Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.2871810Z E       ^
2025-05-07T20:33:07.2872284Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.2872739Z 
2025-05-07T20:33:07.2873151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.2873666Z 
2025-05-07T20:33:07.2873769Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.2874183Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.2874581Z     T=128,
2025-05-07T20:33:07.2874762Z     D=7168,
2025-05-07T20:33:07.2874955Z     scale_ub=1200.0,
2025-05-07T20:33:07.2875178Z     contiguous=True,
2025-05-07T20:33:07.2875395Z     compiled=False,
2025-05-07T20:33:07.2875607Z )
2025-05-07T20:33:07.2875925Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.2876469Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.2876749Z 
2025-05-07T20:33:07.2876826Z     @given(
2025-05-07T20:33:07.2877063Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.2877373Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.2877680Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.2878245Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.2878583Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.2878864Z     )
2025-05-07T20:33:07.2879215Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.2879653Z     def test_silu_mul_quant(
2025-05-07T20:33:07.2879891Z         self,
2025-05-07T20:33:07.2880090Z         T: int,
2025-05-07T20:33:07.2880290Z         D: int,
2025-05-07T20:33:07.2880501Z         scale_ub: Optional[float],
2025-05-07T20:33:07.2880827Z         contiguous: bool,
2025-05-07T20:33:07.2881067Z         compiled: bool,
2025-05-07T20:33:07.2881288Z     ) -> None:
2025-05-07T20:33:07.2881509Z         torch.manual_seed(2025)
2025-05-07T20:33:07.2881752Z     
2025-05-07T20:33:07.2882025Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.2882462Z     
2025-05-07T20:33:07.2882662Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.2882955Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.2885105Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.2886969Z 
2025-05-07T20:33:07.2887091Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:07.2887309Z 
2025-05-07T20:33:07.2887411Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.2887878Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.2888287Z     T=128,
2025-05-07T20:33:07.2888476Z     D=5120,
2025-05-07T20:33:07.2888671Z     scale_ub=1200.0,
2025-05-07T20:33:07.2888897Z     contiguous=True,
2025-05-07T20:33:07.2889115Z     compiled=True,
2025-05-07T20:33:07.2889319Z )
2025-05-07T20:33:07.2889637Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.2890118Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.2890388Z 
2025-05-07T20:33:07.2890464Z     @given(
2025-05-07T20:33:07.2890699Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.2891005Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.2891315Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.2891694Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.2892017Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.2892308Z     )
2025-05-07T20:33:07.2892656Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.2893092Z     def test_silu_mul_quant(
2025-05-07T20:33:07.2893326Z         self,
2025-05-07T20:33:07.2893520Z         T: int,
2025-05-07T20:33:07.2893716Z         D: int,
2025-05-07T20:33:07.2893926Z         scale_ub: Optional[float],
2025-05-07T20:33:07.2894196Z         contiguous: bool,
2025-05-07T20:33:07.2894436Z         compiled: bool,
2025-05-07T20:33:07.2894651Z     ) -> None:
2025-05-07T20:33:07.2894866Z         torch.manual_seed(2025)
2025-05-07T20:33:07.2895110Z     
2025-05-07T20:33:07.2895375Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.2895780Z     
2025-05-07T20:33:07.2896020Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.2896408Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.2898579Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.2900460Z 
2025-05-07T20:33:07.2900578Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:07.2900796Z 
2025-05-07T20:33:07.2900960Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.2901380Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.2901827Z     T=128,
2025-05-07T20:33:07.2902017Z     D=7168,
2025-05-07T20:33:07.2902209Z     scale_ub=None,
2025-05-07T20:33:07.2902418Z     contiguous=True,
2025-05-07T20:33:07.2902693Z     compiled=True,
2025-05-07T20:33:07.2902893Z )
2025-05-07T20:33:07.5018193Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5018832Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.5019101Z 
2025-05-07T20:33:07.5019192Z     @given(
2025-05-07T20:33:07.5019425Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5019750Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5020069Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5020410Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5020768Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5021068Z     )
2025-05-07T20:33:07.5021429Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5021872Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5022121Z         self,
2025-05-07T20:33:07.5022326Z         T: int,
2025-05-07T20:33:07.5022750Z         D: int,
2025-05-07T20:33:07.5022979Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5023258Z         contiguous: bool,
2025-05-07T20:33:07.5023496Z         compiled: bool,
2025-05-07T20:33:07.5023730Z     ) -> None:
2025-05-07T20:33:07.5023952Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5024196Z     
2025-05-07T20:33:07.5024473Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5026573Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.5028454Z 
2025-05-07T20:33:07.5028577Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.5028795Z 
2025-05-07T20:33:07.5040486Z FAILED
2025-05-07T20:33:07.5040683Z 
2025-05-07T20:33:07.5040871Z =================================== FAILURES ===================================
2025-05-07T20:33:07.5041487Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:33:07.5042111Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:33:07.5055483Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 59, in testPartExecutor
2025-05-07T20:33:07.5056545Z   |     yield
2025-05-07T20:33:07.5057150Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 592, in run
2025-05-07T20:33:07.5057888Z   |     self._callTestMethod(testMethod)
2025-05-07T20:33:07.5058696Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 550, in _callTestMethod
2025-05-07T20:33:07.5059457Z   |     method()
2025-05-07T20:33:07.5060345Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:33:07.5061368Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5062273Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:33:07.5063126Z   |     raise the_error_hypothesis_found
2025-05-07T20:33:07.5063931Z   | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:33:07.5064615Z   +-+---------------- 1 ----------------
2025-05-07T20:33:07.5065016Z     | Traceback (most recent call last):
2025-05-07T20:33:07.5065993Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:07.5067237Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5070276Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.5073069Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:07.5073674Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5074256Z     |     T=2048,
2025-05-07T20:33:07.5074577Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:07.5075139Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:07.5075644Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:07.5076131Z     |     compiled=False,  # or any other generated value
2025-05-07T20:33:07.5076561Z     | )
2025-05-07T20:33:07.5076806Z     | 
2025-05-07T20:33:07.5077519Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:33:07.5078364Z     +---------------- 2 ----------------
2025-05-07T20:33:07.5078774Z     | Traceback (most recent call last):
2025-05-07T20:33:07.5079761Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:07.5080826Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5083676Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.5086432Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:07.5087043Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5087663Z     |     T=128,
2025-05-07T20:33:07.5088708Z     |     D=7168,
2025-05-07T20:33:07.5089006Z     |     scale_ub=None,
2025-05-07T20:33:07.5089267Z     |     contiguous=True,
2025-05-07T20:33:07.5089505Z     |     compiled=True,
2025-05-07T20:33:07.5089730Z     | )
2025-05-07T20:33:07.5089920Z     | 
2025-05-07T20:33:07.5090468Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:07.5091076Z     +---------------- 3 ----------------
2025-05-07T20:33:07.5091370Z     | Traceback (most recent call last):
2025-05-07T20:33:07.5092142Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:07.5092934Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5095362Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.5098318Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:07.5098957Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5099559Z     |     T=128,
2025-05-07T20:33:07.5099842Z     |     D=5120,
2025-05-07T20:33:07.5100144Z     |     scale_ub=1200.0,
2025-05-07T20:33:07.5100397Z     |     contiguous=True,
2025-05-07T20:33:07.5100634Z     |     compiled=True,
2025-05-07T20:33:07.5100920Z     | )
2025-05-07T20:33:07.5101176Z     | 
2025-05-07T20:33:07.5101900Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:33:07.5102723Z     +---------------- 4 ----------------
2025-05-07T20:33:07.5103116Z     | Traceback (most recent call last):
2025-05-07T20:33:07.5104592Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:33:07.5105596Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:07.5106510Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:33:07.5107481Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.5108660Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:33:07.5109769Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:07.5110712Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:33:07.5111735Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5112753Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:33:07.5113826Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.5114945Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:33:07.5116078Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.5117319Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:33:07.5118283Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:07.5119199Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:33:07.5119988Z     |     fn()
2025-05-07T20:33:07.5120768Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:33:07.5121655Z     |     self.fn.run(
2025-05-07T20:33:07.5122391Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:33:07.5123184Z     |     kernel = self.compile(
2025-05-07T20:33:07.5124110Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:33:07.5125087Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5126059Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:07.5127196Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5127993Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5128465Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:33:07.5128816Z     | ^
2025-05-07T20:33:07.5129447Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5130217Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:07.5130774Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:33:07.5131495Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5132153Z     |     T=1,  # or any other generated value
2025-05-07T20:33:07.5132580Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:07.5133046Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:07.5133529Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:07.5134089Z     |     compiled=True,  # or any other generated value
2025-05-07T20:33:07.5134503Z     | )
2025-05-07T20:33:07.5134737Z     | 
2025-05-07T20:33:07.5135460Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:07.5136306Z     +------------------------------------
2025-05-07T20:33:07.5136808Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:33:07.5137328Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5137917Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5138493Z     T=1,
2025-05-07T20:33:07.5138737Z     D=5120,
2025-05-07T20:33:07.5139004Z     scale_ub=None,
2025-05-07T20:33:07.5139294Z     contiguous=True,
2025-05-07T20:33:07.5139609Z     compiled=True,
2025-05-07T20:33:07.5139901Z )
2025-05-07T20:33:07.5140346Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5140998Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.5141364Z 
2025-05-07T20:33:07.5141469Z     @given(
2025-05-07T20:33:07.5141780Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5142230Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5142651Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5143110Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5143564Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5143950Z     )
2025-05-07T20:33:07.5144419Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5145084Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5145404Z         self,
2025-05-07T20:33:07.5145653Z         T: int,
2025-05-07T20:33:07.5145911Z         D: int,
2025-05-07T20:33:07.5146194Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5146545Z         contiguous: bool,
2025-05-07T20:33:07.5146852Z         compiled: bool,
2025-05-07T20:33:07.5147140Z     ) -> None:
2025-05-07T20:33:07.5147413Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5147724Z     
2025-05-07T20:33:07.5148076Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5148515Z     
2025-05-07T20:33:07.5148761Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5149132Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5149536Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5149949Z         x0 = x[:, :D]
2025-05-07T20:33:07.5150229Z         x1 = x[:, D:]
2025-05-07T20:33:07.5150556Z     
2025-05-07T20:33:07.5150803Z         if contiguous:
2025-05-07T20:33:07.5151124Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5151488Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5151833Z     
2025-05-07T20:33:07.5152083Z         if scale_ub is not None:
2025-05-07T20:33:07.5152448Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5153332Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5153733Z             )
2025-05-07T20:33:07.5153976Z         else:
2025-05-07T20:33:07.5154237Z             scale_ub_tensor = None
2025-05-07T20:33:07.5154566Z     
2025-05-07T20:33:07.5154861Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5155260Z             op = silu_mul_quant
2025-05-07T20:33:07.5155583Z             if compiled:
2025-05-07T20:33:07.5155915Z                 op = torch.compile(op)
2025-05-07T20:33:07.5156312Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5156704Z     
2025-05-07T20:33:07.5156971Z         y_fp8, y_scale = fn()
2025-05-07T20:33:07.5157363Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:07.5157765Z     
2025-05-07T20:33:07.5158053Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5158484Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:07.5158896Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:07.5159336Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:07.5159836Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.5160251Z     
2025-05-07T20:33:07.5160513Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:07.5160768Z 
2025-05-07T20:33:07.5160903Z moe/activation_test.py:126: 
2025-05-07T20:33:07.5161289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5161743Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:07.5162196Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.5163229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:07.5164213Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:07.5164926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5165817Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5166724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:07.5167669Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.5168662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:07.5169650Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.5170694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:07.5171528Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:07.5172326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:07.5173018Z     fn()
2025-05-07T20:33:07.5173692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:07.5174490Z     self.fn.run(
2025-05-07T20:33:07.5175124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5175834Z     kernel = self.compile(
2025-05-07T20:33:07.5176573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5177547Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5178118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5178437Z 
2025-05-07T20:33:07.5178721Z self = <triton.compiler.compiler.ASTSource object at 0x7f891b2b8040>
2025-05-07T20:33:07.5180271Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5182286Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f891b3dd9d0>}
2025-05-07T20:33:07.5184175Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5185540Z context = <triton._C.libtriton.ir.context object at 0x7f891b8f98b0>
2025-05-07T20:33:07.5185919Z 
2025-05-07T20:33:07.5186132Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5186812Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5187477Z                            module_map=module_map)
2025-05-07T20:33:07.5187948Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5188397Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:07.5188742Z E       ^
2025-05-07T20:33:07.5189344Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5190046Z 
2025-05-07T20:33:07.5190614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5191293Z 
2025-05-07T20:33:07.5191429Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5192002Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5192538Z     T=2048,
2025-05-07T20:33:07.5192770Z     D=5120,
2025-05-07T20:33:07.5193017Z     scale_ub=1200.0,
2025-05-07T20:33:07.5193303Z     contiguous=True,
2025-05-07T20:33:07.5193582Z     compiled=False,
2025-05-07T20:33:07.5193856Z )
2025-05-07T20:33:07.5194276Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5194926Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.5195300Z 
2025-05-07T20:33:07.5195399Z     @given(
2025-05-07T20:33:07.5195702Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5196110Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5196515Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5196962Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5197416Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5197876Z     )
2025-05-07T20:33:07.5198363Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5198966Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5199282Z         self,
2025-05-07T20:33:07.5199540Z         T: int,
2025-05-07T20:33:07.5199806Z         D: int,
2025-05-07T20:33:07.5200088Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5200457Z         contiguous: bool,
2025-05-07T20:33:07.5200787Z         compiled: bool,
2025-05-07T20:33:07.5201081Z     ) -> None:
2025-05-07T20:33:07.5201356Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5201672Z     
2025-05-07T20:33:07.5202024Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5202459Z     
2025-05-07T20:33:07.5202705Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5203079Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5203469Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5204154Z         x0 = x[:, :D]
2025-05-07T20:33:07.5204450Z         x1 = x[:, D:]
2025-05-07T20:33:07.5204712Z     
2025-05-07T20:33:07.5204948Z         if contiguous:
2025-05-07T20:33:07.5205243Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5205573Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5205962Z     
2025-05-07T20:33:07.5206216Z         if scale_ub is not None:
2025-05-07T20:33:07.5206563Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5207015Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5207414Z             )
2025-05-07T20:33:07.5207661Z         else:
2025-05-07T20:33:07.5207941Z             scale_ub_tensor = None
2025-05-07T20:33:07.5208278Z     
2025-05-07T20:33:07.5208600Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5209029Z             op = silu_mul_quant
2025-05-07T20:33:07.5209379Z             if compiled:
2025-05-07T20:33:07.5209709Z                 op = torch.compile(op)
2025-05-07T20:33:07.5210109Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5210473Z     
2025-05-07T20:33:07.5210728Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5210946Z 
2025-05-07T20:33:07.5211078Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5211562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5212017Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5212380Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5213301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5214226Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5214945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5215878Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5216769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5217504Z     kernel = self.compile(
2025-05-07T20:33:07.5218250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5219147Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5219689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5220006Z 
2025-05-07T20:33:07.5220289Z self = <triton.compiler.compiler.ASTSource object at 0x7f8a6d296e50>
2025-05-07T20:33:07.5221837Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5223791Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f88f9ced5e0>}
2025-05-07T20:33:07.5225766Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5227204Z context = <triton._C.libtriton.ir.context object at 0x7f8919f41870>
2025-05-07T20:33:07.5227604Z 
2025-05-07T20:33:07.5227841Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5228564Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5229204Z                            module_map=module_map)
2025-05-07T20:33:07.5229694Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5230282Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5230626Z E       ^
2025-05-07T20:33:07.5231319Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5231960Z 
2025-05-07T20:33:07.5232532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5233245Z 
2025-05-07T20:33:07.5233383Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5234009Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5234558Z     T=2048,
2025-05-07T20:33:07.5234811Z     D=5120,
2025-05-07T20:33:07.5235063Z     scale_ub=1200.0,
2025-05-07T20:33:07.5235376Z     contiguous=True,
2025-05-07T20:33:07.5235698Z     compiled=True,
2025-05-07T20:33:07.5235968Z )
2025-05-07T20:33:07.5236404Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5237087Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.5237469Z 
2025-05-07T20:33:07.5237574Z     @given(
2025-05-07T20:33:07.5237892Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5238319Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5238730Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5239184Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5239692Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5240084Z     )
2025-05-07T20:33:07.5240544Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5241143Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5241457Z         self,
2025-05-07T20:33:07.5241709Z         T: int,
2025-05-07T20:33:07.5241959Z         D: int,
2025-05-07T20:33:07.5242268Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5242629Z         contiguous: bool,
2025-05-07T20:33:07.5262631Z         compiled: bool,
2025-05-07T20:33:07.5262879Z     ) -> None:
2025-05-07T20:33:07.5263102Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5263416Z     
2025-05-07T20:33:07.5263758Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5264111Z     
2025-05-07T20:33:07.5264310Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5264601Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5264918Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5265173Z         x0 = x[:, :D]
2025-05-07T20:33:07.5265394Z         x1 = x[:, D:]
2025-05-07T20:33:07.5265605Z     
2025-05-07T20:33:07.5265796Z         if contiguous:
2025-05-07T20:33:07.5266025Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5266292Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5266534Z     
2025-05-07T20:33:07.5266721Z         if scale_ub is not None:
2025-05-07T20:33:07.5266999Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5267336Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5267649Z             )
2025-05-07T20:33:07.5267837Z         else:
2025-05-07T20:33:07.5268199Z             scale_ub_tensor = None
2025-05-07T20:33:07.5268452Z     
2025-05-07T20:33:07.5268679Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5268996Z             op = silu_mul_quant
2025-05-07T20:33:07.5269245Z             if compiled:
2025-05-07T20:33:07.5269488Z                 op = torch.compile(op)
2025-05-07T20:33:07.5269866Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5270157Z     
2025-05-07T20:33:07.5270343Z         y_fp8, y_scale = fn()
2025-05-07T20:33:07.5270629Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:07.5270923Z     
2025-05-07T20:33:07.5271157Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5271497Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:07.5271791Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:07.5272105Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:07.5272460Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.5272834Z     
2025-05-07T20:33:07.5273037Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:07.5273233Z 
2025-05-07T20:33:07.5273334Z moe/activation_test.py:126: 
2025-05-07T20:33:07.5273638Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5274026Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:07.5274347Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.5275139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:07.5275907Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:07.5276452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5277124Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5277813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:07.5278613Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.5280525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:07.5281395Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.5282184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:07.5282824Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:07.5283435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:07.5283958Z     fn()
2025-05-07T20:33:07.5284467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:07.5285055Z     self.fn.run(
2025-05-07T20:33:07.5285518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5286053Z     kernel = self.compile(
2025-05-07T20:33:07.5286607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5287275Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5287680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5287925Z 
2025-05-07T20:33:07.5288135Z self = <triton.compiler.compiler.ASTSource object at 0x7f891b477b50>
2025-05-07T20:33:07.5289238Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5290706Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8919e54160>}
2025-05-07T20:33:07.5292112Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5293143Z context = <triton._C.libtriton.ir.context object at 0x7f8919cd01f0>
2025-05-07T20:33:07.5293433Z 
2025-05-07T20:33:07.5293598Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5294120Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5294583Z                            module_map=module_map)
2025-05-07T20:33:07.5294950Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5295306Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:07.5295619Z E       ^
2025-05-07T20:33:07.5296087Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5296547Z 
2025-05-07T20:33:07.5296967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5297521Z 
2025-05-07T20:33:07.5297635Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5298043Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5298452Z     T=16384,
2025-05-07T20:33:07.5298645Z     D=7168,
2025-05-07T20:33:07.5298832Z     scale_ub=1200.0,
2025-05-07T20:33:07.5299060Z     contiguous=False,
2025-05-07T20:33:07.5299293Z     compiled=False,
2025-05-07T20:33:07.5299493Z )
2025-05-07T20:33:07.5299815Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5300321Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:07.5300608Z 
2025-05-07T20:33:07.5300692Z     @given(
2025-05-07T20:33:07.5300917Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5301237Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5301573Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5301980Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5302322Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5302610Z     )
2025-05-07T20:33:07.5302960Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5303407Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5303649Z         self,
2025-05-07T20:33:07.5304125Z         T: int,
2025-05-07T20:33:07.5304318Z         D: int,
2025-05-07T20:33:07.5304534Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5304806Z         contiguous: bool,
2025-05-07T20:33:07.5305042Z         compiled: bool,
2025-05-07T20:33:07.5305276Z     ) -> None:
2025-05-07T20:33:07.5305495Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5305734Z     
2025-05-07T20:33:07.5306010Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5306357Z     
2025-05-07T20:33:07.5306550Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5306855Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5307166Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5307399Z         x0 = x[:, :D]
2025-05-07T20:33:07.5307623Z         x1 = x[:, D:]
2025-05-07T20:33:07.5307840Z     
2025-05-07T20:33:07.5308020Z         if contiguous:
2025-05-07T20:33:07.5308252Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5308519Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5308756Z     
2025-05-07T20:33:07.5308948Z         if scale_ub is not None:
2025-05-07T20:33:07.5309221Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5309556Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5310077Z             )
2025-05-07T20:33:07.5310270Z         else:
2025-05-07T20:33:07.5310482Z             scale_ub_tensor = None
2025-05-07T20:33:07.5310726Z     
2025-05-07T20:33:07.5310958Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5311272Z             op = silu_mul_quant
2025-05-07T20:33:07.5311523Z             if compiled:
2025-05-07T20:33:07.5311769Z                 op = torch.compile(op)
2025-05-07T20:33:07.5312065Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5312332Z     
2025-05-07T20:33:07.5312523Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5312688Z 
2025-05-07T20:33:07.5312793Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5313079Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5313408Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5313687Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5314476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5315171Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5315708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5316390Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5317117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5317644Z     kernel = self.compile(
2025-05-07T20:33:07.5318186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5318840Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5319230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5319463Z 
2025-05-07T20:33:07.5319670Z self = <triton.compiler.compiler.ASTSource object at 0x7f8919e7adc0>
2025-05-07T20:33:07.5320759Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5322213Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8919dffe50>}
2025-05-07T20:33:07.5323565Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5324581Z context = <triton._C.libtriton.ir.context object at 0x7f8919c16270>
2025-05-07T20:33:07.5324870Z 
2025-05-07T20:33:07.5325037Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5325564Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5326037Z                            module_map=module_map)
2025-05-07T20:33:07.5326396Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5326749Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5327013Z E       ^
2025-05-07T20:33:07.5327476Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5327934Z 
2025-05-07T20:33:07.5328357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5328877Z 
2025-05-07T20:33:07.5328981Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5329399Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5329793Z     T=1,
2025-05-07T20:33:07.5329973Z     D=7168,
2025-05-07T20:33:07.5330216Z     scale_ub=None,
2025-05-07T20:33:07.5330424Z     contiguous=True,
2025-05-07T20:33:07.5330645Z     compiled=True,
2025-05-07T20:33:07.5330844Z )
2025-05-07T20:33:07.5331156Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5331659Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.5331948Z 
2025-05-07T20:33:07.5332029Z     @given(
2025-05-07T20:33:07.5332254Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5332564Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5332868Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5333197Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5333521Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5333804Z     )
2025-05-07T20:33:07.5334148Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5334581Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5334871Z         self,
2025-05-07T20:33:07.5335066Z         T: int,
2025-05-07T20:33:07.5335255Z         D: int,
2025-05-07T20:33:07.5335468Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5335738Z         contiguous: bool,
2025-05-07T20:33:07.5335969Z         compiled: bool,
2025-05-07T20:33:07.5336190Z     ) -> None:
2025-05-07T20:33:07.5336454Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5336690Z     
2025-05-07T20:33:07.5336956Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5337294Z     
2025-05-07T20:33:07.5337485Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5337769Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5338076Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5338320Z         x0 = x[:, :D]
2025-05-07T20:33:07.5338530Z         x1 = x[:, D:]
2025-05-07T20:33:07.5338734Z     
2025-05-07T20:33:07.5338924Z         if contiguous:
2025-05-07T20:33:07.5339150Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5339417Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5339659Z     
2025-05-07T20:33:07.5339845Z         if scale_ub is not None:
2025-05-07T20:33:07.5340121Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5340461Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5340843Z             )
2025-05-07T20:33:07.5341042Z         else:
2025-05-07T20:33:07.5341254Z             scale_ub_tensor = None
2025-05-07T20:33:07.5341503Z     
2025-05-07T20:33:07.5341772Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5342099Z             op = silu_mul_quant
2025-05-07T20:33:07.5342346Z             if compiled:
2025-05-07T20:33:07.5342597Z                 op = torch.compile(op)
2025-05-07T20:33:07.5343171Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5343525Z     
2025-05-07T20:33:07.5343836Z         y_fp8, y_scale = fn()
2025-05-07T20:33:07.5344262Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:07.5344631Z     
2025-05-07T20:33:07.5344975Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5345437Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:07.5345799Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:07.5346228Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:07.5346718Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.5347155Z     
2025-05-07T20:33:07.5347408Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:07.5347657Z 
2025-05-07T20:33:07.5347838Z moe/activation_test.py:126: 
2025-05-07T20:33:07.5348263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5348650Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:07.5349121Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.5350239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:07.5351293Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:07.5352105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5353140Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5354116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:07.5355090Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.5355899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:07.5356738Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.5357706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:07.5358435Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:07.5359089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:07.5359773Z     fn()
2025-05-07T20:33:07.5360408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:07.5361063Z     self.fn.run(
2025-05-07T20:33:07.5361716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5362349Z     kernel = self.compile(
2025-05-07T20:33:07.5362989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5363771Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5364262Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5364579Z 
2025-05-07T20:33:07.5364800Z self = <triton.compiler.compiler.ASTSource object at 0x7f8919ce4b20>
2025-05-07T20:33:07.5366108Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5367582Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8919dff550>}
2025-05-07T20:33:07.5369019Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5370220Z context = <triton._C.libtriton.ir.context object at 0x7f891993f4f0>
2025-05-07T20:33:07.5370533Z 
2025-05-07T20:33:07.5370793Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5371445Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5371979Z                            module_map=module_map)
2025-05-07T20:33:07.5372451Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5372941Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:07.5373275Z E       ^
2025-05-07T20:33:07.5373842Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5374341Z 
2025-05-07T20:33:07.5374855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5375411Z 
2025-05-07T20:33:07.5375592Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5376140Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5376743Z     T=4096,
2025-05-07T20:33:07.5377065Z     D=5120,
2025-05-07T20:33:07.5377340Z     scale_ub=None,
2025-05-07T20:33:07.5377662Z     contiguous=False,
2025-05-07T20:33:07.5378021Z     compiled=False,
2025-05-07T20:33:07.5378325Z )
2025-05-07T20:33:07.5378741Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5379371Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.5379674Z 
2025-05-07T20:33:07.5379815Z     @given(
2025-05-07T20:33:07.5380159Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5380578Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5381050Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5381555Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5381972Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5382369Z     )
2025-05-07T20:33:07.5382933Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5383439Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5383799Z         self,
2025-05-07T20:33:07.5384148Z         T: int,
2025-05-07T20:33:07.5384392Z         D: int,
2025-05-07T20:33:07.5384719Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5385204Z         contiguous: bool,
2025-05-07T20:33:07.5385623Z         compiled: bool,
2025-05-07T20:33:07.5385895Z     ) -> None:
2025-05-07T20:33:07.5386261Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5386609Z     
2025-05-07T20:33:07.5386938Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5387434Z     
2025-05-07T20:33:07.5387733Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5388080Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5388536Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5388888Z         x0 = x[:, :D]
2025-05-07T20:33:07.5389153Z         x1 = x[:, D:]
2025-05-07T20:33:07.5389506Z     
2025-05-07T20:33:07.5389933Z         if contiguous:
2025-05-07T20:33:07.5390234Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5390629Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5390979Z     
2025-05-07T20:33:07.5391245Z         if scale_ub is not None:
2025-05-07T20:33:07.5391663Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5392194Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5392579Z             )
2025-05-07T20:33:07.5392918Z         else:
2025-05-07T20:33:07.5393218Z             scale_ub_tensor = None
2025-05-07T20:33:07.5393547Z     
2025-05-07T20:33:07.5393998Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5394402Z             op = silu_mul_quant
2025-05-07T20:33:07.5394724Z             if compiled:
2025-05-07T20:33:07.5395164Z                 op = torch.compile(op)
2025-05-07T20:33:07.5395517Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5395888Z     
2025-05-07T20:33:07.5396274Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5396468Z 
2025-05-07T20:33:07.5396594Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5396984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5397478Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5397853Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5398622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5399561Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5400210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5400931Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5401806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5402456Z     kernel = self.compile(
2025-05-07T20:33:07.5403181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5404159Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5404682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5404962Z 
2025-05-07T20:33:07.5405268Z self = <triton.compiler.compiler.ASTSource object at 0x7f8919a3c4f0>
2025-05-07T20:33:07.5406495Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5407941Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8919a23940>}
2025-05-07T20:33:07.5409591Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5410756Z context = <triton._C.libtriton.ir.context object at 0x7f8919369330>
2025-05-07T20:33:07.5411108Z 
2025-05-07T20:33:07.5411375Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5412141Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5412664Z                            module_map=module_map)
2025-05-07T20:33:07.5413115Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5413630Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5413947Z E       ^
2025-05-07T20:33:07.5414501Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5415076Z 
2025-05-07T20:33:07.5415572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5416116Z 
2025-05-07T20:33:07.5416280Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5416816Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5417460Z     T=4096,
2025-05-07T20:33:07.5417745Z     D=7168,
2025-05-07T20:33:07.5418028Z     scale_ub=None,
2025-05-07T20:33:07.5418369Z     contiguous=False,
2025-05-07T20:33:07.5418691Z     compiled=False,
2025-05-07T20:33:07.5418987Z )
2025-05-07T20:33:07.5419426Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5420011Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.5420338Z 
2025-05-07T20:33:07.5420529Z     @given(
2025-05-07T20:33:07.5420925Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5421336Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5421820Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5422381Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5422847Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5423354Z     )
2025-05-07T20:33:07.5423949Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5424554Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5424876Z         self,
2025-05-07T20:33:07.5425276Z         T: int,
2025-05-07T20:33:07.5425546Z         D: int,
2025-05-07T20:33:07.5425895Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5426375Z         contiguous: bool,
2025-05-07T20:33:07.5426795Z         compiled: bool,
2025-05-07T20:33:07.5427113Z     ) -> None:
2025-05-07T20:33:07.5427623Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5428019Z     
2025-05-07T20:33:07.5428381Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5429031Z     
2025-05-07T20:33:07.5429315Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5429639Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5430211Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5430544Z         x0 = x[:, :D]
2025-05-07T20:33:07.5430902Z         x1 = x[:, D:]
2025-05-07T20:33:07.5431205Z     
2025-05-07T20:33:07.5431583Z         if contiguous:
2025-05-07T20:33:07.5431975Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5432321Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5432652Z     
2025-05-07T20:33:07.5432979Z         if scale_ub is not None:
2025-05-07T20:33:07.5433319Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5433746Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5439767Z             )
2025-05-07T20:33:07.5439999Z         else:
2025-05-07T20:33:07.5440221Z             scale_ub_tensor = None
2025-05-07T20:33:07.5440474Z     
2025-05-07T20:33:07.5440794Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5441128Z             op = silu_mul_quant
2025-05-07T20:33:07.5441382Z             if compiled:
2025-05-07T20:33:07.5441635Z                 op = torch.compile(op)
2025-05-07T20:33:07.5441990Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5442260Z     
2025-05-07T20:33:07.5442505Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5442671Z 
2025-05-07T20:33:07.5442775Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5443075Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5443414Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5443699Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5444403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5445097Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5445643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5446333Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5447004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5447535Z     kernel = self.compile(
2025-05-07T20:33:07.5448137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5448791Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5449184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5449417Z 
2025-05-07T20:33:07.5449621Z self = <triton.compiler.compiler.ASTSource object at 0x7f891b284520>
2025-05-07T20:33:07.5450717Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5452119Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89199e65e0>}
2025-05-07T20:33:07.5453480Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5454511Z context = <triton._C.libtriton.ir.context object at 0x7f89198c5db0>
2025-05-07T20:33:07.5454806Z 
2025-05-07T20:33:07.5454968Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5455494Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5455964Z                            module_map=module_map)
2025-05-07T20:33:07.5456328Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5456728Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5456983Z E       ^
2025-05-07T20:33:07.5457451Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5457911Z 
2025-05-07T20:33:07.5458334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5458854Z 
2025-05-07T20:33:07.5458955Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5459374Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5459774Z     T=128,
2025-05-07T20:33:07.5459960Z     D=7168,
2025-05-07T20:33:07.5460155Z     scale_ub=None,
2025-05-07T20:33:07.5460371Z     contiguous=False,
2025-05-07T20:33:07.5460596Z     compiled=True,
2025-05-07T20:33:07.5460797Z )
2025-05-07T20:33:07.5461117Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5461655Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:07.5461923Z 
2025-05-07T20:33:07.5462010Z     @given(
2025-05-07T20:33:07.5462237Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5462549Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5462907Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5463238Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5463556Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5463838Z     )
2025-05-07T20:33:07.5464197Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5464631Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5464866Z         self,
2025-05-07T20:33:07.5465051Z         T: int,
2025-05-07T20:33:07.5465236Z         D: int,
2025-05-07T20:33:07.5465449Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5465721Z         contiguous: bool,
2025-05-07T20:33:07.5465961Z         compiled: bool,
2025-05-07T20:33:07.5466173Z     ) -> None:
2025-05-07T20:33:07.5466387Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5466625Z     
2025-05-07T20:33:07.5466892Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5467232Z     
2025-05-07T20:33:07.5467509Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5467793Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5468102Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5468338Z         x0 = x[:, :D]
2025-05-07T20:33:07.5468543Z         x1 = x[:, D:]
2025-05-07T20:33:07.5468744Z     
2025-05-07T20:33:07.5468922Z         if contiguous:
2025-05-07T20:33:07.5469141Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5469392Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5469627Z     
2025-05-07T20:33:07.5469874Z         if scale_ub is not None:
2025-05-07T20:33:07.5470146Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5470487Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5470785Z             )
2025-05-07T20:33:07.5470973Z         else:
2025-05-07T20:33:07.5471177Z             scale_ub_tensor = None
2025-05-07T20:33:07.5471415Z     
2025-05-07T20:33:07.5471649Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5471962Z             op = silu_mul_quant
2025-05-07T20:33:07.5472202Z             if compiled:
2025-05-07T20:33:07.5472451Z                 op = torch.compile(op)
2025-05-07T20:33:07.5472745Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5473014Z     
2025-05-07T20:33:07.5473196Z         y_fp8, y_scale = fn()
2025-05-07T20:33:07.5473478Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:07.5473763Z     
2025-05-07T20:33:07.5473985Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5474311Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:07.5474599Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:07.5474966Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:07.5475317Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.5475624Z     
2025-05-07T20:33:07.5475810Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:07.5476011Z 
2025-05-07T20:33:07.5476111Z moe/activation_test.py:126: 
2025-05-07T20:33:07.5476403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5476729Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:07.5477044Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.5477840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:07.5478605Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:07.5479186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5479865Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5480556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:07.5481274Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.5482109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:07.5483029Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.5483757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:07.5484398Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:07.5484999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:07.5485518Z     fn()
2025-05-07T20:33:07.5486017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:07.5486590Z     self.fn.run(
2025-05-07T20:33:07.5487100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5487628Z     kernel = self.compile(
2025-05-07T20:33:07.5488163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5488810Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5489215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5489439Z 
2025-05-07T20:33:07.5489644Z self = <triton.compiler.compiler.ASTSource object at 0x7f8919b74610>
2025-05-07T20:33:07.5490733Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5492231Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8919e54310>}
2025-05-07T20:33:07.5493587Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5494661Z context = <triton._C.libtriton.ir.context object at 0x7f89192cf970>
2025-05-07T20:33:07.5494944Z 
2025-05-07T20:33:07.5495155Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5495897Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5496562Z                            module_map=module_map)
2025-05-07T20:33:07.5497065Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5497566Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:07.5497917Z E       ^
2025-05-07T20:33:07.5498518Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5499113Z 
2025-05-07T20:33:07.5499664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5500374Z 
2025-05-07T20:33:07.5500524Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5501047Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5501611Z     T=128,
2025-05-07T20:33:07.5501873Z     D=7168,
2025-05-07T20:33:07.5502082Z     scale_ub=None,
2025-05-07T20:33:07.5502385Z     contiguous=False,
2025-05-07T20:33:07.5502694Z     compiled=False,
2025-05-07T20:33:07.5502979Z )
2025-05-07T20:33:07.5503517Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5504411Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.5504784Z 
2025-05-07T20:33:07.5504888Z     @given(
2025-05-07T20:33:07.5505208Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5505776Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5506192Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5506592Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5506916Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5507200Z     )
2025-05-07T20:33:07.5507541Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5507986Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5508226Z         self,
2025-05-07T20:33:07.5508409Z         T: int,
2025-05-07T20:33:07.5508602Z         D: int,
2025-05-07T20:33:07.5508820Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5509081Z         contiguous: bool,
2025-05-07T20:33:07.5509314Z         compiled: bool,
2025-05-07T20:33:07.5509534Z     ) -> None:
2025-05-07T20:33:07.5509739Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5510047Z     
2025-05-07T20:33:07.5510404Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5510753Z     
2025-05-07T20:33:07.5510942Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5511227Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5511540Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5511770Z         x0 = x[:, :D]
2025-05-07T20:33:07.5511982Z         x1 = x[:, D:]
2025-05-07T20:33:07.5512189Z     
2025-05-07T20:33:07.5512362Z         if contiguous:
2025-05-07T20:33:07.5512588Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5512842Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5513072Z     
2025-05-07T20:33:07.5513261Z         if scale_ub is not None:
2025-05-07T20:33:07.5513530Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5513857Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5514163Z             )
2025-05-07T20:33:07.5514350Z         else:
2025-05-07T20:33:07.5514552Z             scale_ub_tensor = None
2025-05-07T20:33:07.5514804Z     
2025-05-07T20:33:07.5515035Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5515336Z             op = silu_mul_quant
2025-05-07T20:33:07.5515590Z             if compiled:
2025-05-07T20:33:07.5515837Z                 op = torch.compile(op)
2025-05-07T20:33:07.5516125Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5516410Z     
2025-05-07T20:33:07.5516599Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5516761Z 
2025-05-07T20:33:07.5516860Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5517154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5517564Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5517835Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5518526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5519219Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5519763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5520436Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5521093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5521621Z     kernel = self.compile(
2025-05-07T20:33:07.5522159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5522880Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5523278Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5523503Z 
2025-05-07T20:33:07.5523712Z self = <triton.compiler.compiler.ASTSource object at 0x7f891967bdc0>
2025-05-07T20:33:07.5524797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5526237Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f891957e160>}
2025-05-07T20:33:07.5527590Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5528620Z context = <triton._C.libtriton.ir.context object at 0x7f8918e63ab0>
2025-05-07T20:33:07.5528906Z 
2025-05-07T20:33:07.5529081Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5529597Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5530106Z                            module_map=module_map)
2025-05-07T20:33:07.5530477Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5530825Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5531080Z E       ^
2025-05-07T20:33:07.5531537Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5531989Z 
2025-05-07T20:33:07.5532407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5532918Z 
2025-05-07T20:33:07.5533018Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5533432Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5533827Z     T=4096,
2025-05-07T20:33:07.5534005Z     D=5120,
2025-05-07T20:33:07.5534192Z     scale_ub=1200.0,
2025-05-07T20:33:07.5534409Z     contiguous=True,
2025-05-07T20:33:07.5534638Z     compiled=False,
2025-05-07T20:33:07.5534845Z )
2025-05-07T20:33:07.5535161Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5535749Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.5536027Z 
2025-05-07T20:33:07.5536102Z     @given(
2025-05-07T20:33:07.5536324Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5536633Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5536929Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5537255Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5537580Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5537924Z     )
2025-05-07T20:33:07.5538273Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5538715Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5538957Z         self,
2025-05-07T20:33:07.5539146Z         T: int,
2025-05-07T20:33:07.5539344Z         D: int,
2025-05-07T20:33:07.5539567Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5539834Z         contiguous: bool,
2025-05-07T20:33:07.5540075Z         compiled: bool,
2025-05-07T20:33:07.5540298Z     ) -> None:
2025-05-07T20:33:07.5540509Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5540750Z     
2025-05-07T20:33:07.5541018Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5541359Z     
2025-05-07T20:33:07.5541558Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5541848Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5542152Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5542450Z         x0 = x[:, :D]
2025-05-07T20:33:07.5542667Z         x1 = x[:, D:]
2025-05-07T20:33:07.5542868Z     
2025-05-07T20:33:07.5543060Z         if contiguous:
2025-05-07T20:33:07.5543288Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5543541Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5543782Z     
2025-05-07T20:33:07.5544028Z         if scale_ub is not None:
2025-05-07T20:33:07.5544302Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5544631Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5544940Z             )
2025-05-07T20:33:07.5545133Z         else:
2025-05-07T20:33:07.5545342Z             scale_ub_tensor = None
2025-05-07T20:33:07.5545595Z     
2025-05-07T20:33:07.5545823Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5546125Z             op = silu_mul_quant
2025-05-07T20:33:07.5546376Z             if compiled:
2025-05-07T20:33:07.5546623Z                 op = torch.compile(op)
2025-05-07T20:33:07.5546919Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5547193Z     
2025-05-07T20:33:07.5547379Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5547541Z 
2025-05-07T20:33:07.5547640Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5547933Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5548385Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5548663Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5549342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5550095Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5550627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5551300Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5552009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5552536Z     kernel = self.compile(
2025-05-07T20:33:07.5552912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5553094Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5553220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5553225Z 
2025-05-07T20:33:07.5553426Z self = <triton.compiler.compiler.ASTSource object at 0x7f8918e74b20>
2025-05-07T20:33:07.5554213Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5554723Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89194f75e0>}
2025-05-07T20:33:07.5555526Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5555716Z context = <triton._C.libtriton.ir.context object at 0x7f89192b7930>
2025-05-07T20:33:07.5555724Z 
2025-05-07T20:33:07.5555894Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5556155Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5556260Z                            module_map=module_map)
2025-05-07T20:33:07.5556429Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5556523Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5556596Z E       ^
2025-05-07T20:33:07.5556993Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5557000Z 
2025-05-07T20:33:07.5557413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5557418Z 
2025-05-07T20:33:07.5557520Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5557791Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5557868Z     T=1,
2025-05-07T20:33:07.5557949Z     D=5120,
2025-05-07T20:33:07.5558028Z     scale_ub=None,
2025-05-07T20:33:07.5558110Z     contiguous=True,
2025-05-07T20:33:07.5558195Z     compiled=True,
2025-05-07T20:33:07.5558263Z )
2025-05-07T20:33:07.5558489Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5558651Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.5558656Z 
2025-05-07T20:33:07.5558729Z     @given(
2025-05-07T20:33:07.5558860Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5558960Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5559070Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5559191Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5559301Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5559417Z     )
2025-05-07T20:33:07.5559667Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5559759Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5559841Z         self,
2025-05-07T20:33:07.5559914Z         T: int,
2025-05-07T20:33:07.5559986Z         D: int,
2025-05-07T20:33:07.5560085Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5560171Z         contiguous: bool,
2025-05-07T20:33:07.5560255Z         compiled: bool,
2025-05-07T20:33:07.5560335Z     ) -> None:
2025-05-07T20:33:07.5560433Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5560501Z     
2025-05-07T20:33:07.5560677Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5560753Z     
2025-05-07T20:33:07.5560840Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5560969Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5561055Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5561141Z         x0 = x[:, :D]
2025-05-07T20:33:07.5561225Z         x1 = x[:, D:]
2025-05-07T20:33:07.5561294Z     
2025-05-07T20:33:07.5561380Z         if contiguous:
2025-05-07T20:33:07.5561469Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5561557Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5561631Z     
2025-05-07T20:33:07.5561719Z         if scale_ub is not None:
2025-05-07T20:33:07.5561848Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5562029Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5562105Z             )
2025-05-07T20:33:07.5562180Z         else:
2025-05-07T20:33:07.5562299Z             scale_ub_tensor = None
2025-05-07T20:33:07.5562433Z     
2025-05-07T20:33:07.5562563Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5562657Z             op = silu_mul_quant
2025-05-07T20:33:07.5562741Z             if compiled:
2025-05-07T20:33:07.5562847Z                 op = torch.compile(op)
2025-05-07T20:33:07.5562952Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5563023Z     
2025-05-07T20:33:07.5563114Z         y_fp8, y_scale = fn()
2025-05-07T20:33:07.5563232Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:07.5563304Z     
2025-05-07T20:33:07.5563443Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5563542Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:07.5563639Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:07.5563764Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:07.5563902Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.5563981Z     
2025-05-07T20:33:07.5564152Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:07.5564158Z 
2025-05-07T20:33:07.5564256Z moe/activation_test.py:126: 
2025-05-07T20:33:07.5564385Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5564487Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:07.5564666Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.5565230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:07.5565327Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:07.5565707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5565935Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5566353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:07.5566617Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.5567107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:07.5567472Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.5567850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:07.5568036Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:07.5568431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:07.5568507Z     fn()
2025-05-07T20:33:07.5568977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:07.5569069Z     self.fn.run(
2025-05-07T20:33:07.5569438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5569559Z     kernel = self.compile(
2025-05-07T20:33:07.5569943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5570119Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5570250Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5570255Z 
2025-05-07T20:33:07.5570457Z self = <triton.compiler.compiler.ASTSource object at 0x7f891927f400>
2025-05-07T20:33:07.5571251Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5571765Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f891928c5e0>}
2025-05-07T20:33:07.5572575Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5572768Z context = <triton._C.libtriton.ir.context object at 0x7f891923ae30>
2025-05-07T20:33:07.5572773Z 
2025-05-07T20:33:07.5572935Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5573201Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5573304Z                            module_map=module_map)
2025-05-07T20:33:07.5573463Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5573572Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:07.5573644Z E       ^
2025-05-07T20:33:07.5574047Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5574060Z 
2025-05-07T20:33:07.5574475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5574517Z 
2025-05-07T20:33:07.5574620Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5574852Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5574928Z     T=2048,
2025-05-07T20:33:07.5575002Z     D=5120,
2025-05-07T20:33:07.5575086Z     scale_ub=None,
2025-05-07T20:33:07.5575171Z     contiguous=True,
2025-05-07T20:33:07.5575250Z     compiled=True,
2025-05-07T20:33:07.5575323Z )
2025-05-07T20:33:07.5575539Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5575717Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.5575725Z 
2025-05-07T20:33:07.5575804Z     @given(
2025-05-07T20:33:07.5575922Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5576021Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5576134Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5576297Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5576416Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5576487Z     )
2025-05-07T20:33:07.5576733Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5576829Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5576901Z         self,
2025-05-07T20:33:07.5576982Z         T: int,
2025-05-07T20:33:07.5577055Z         D: int,
2025-05-07T20:33:07.5577150Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5577241Z         contiguous: bool,
2025-05-07T20:33:07.5577323Z         compiled: bool,
2025-05-07T20:33:07.5577398Z     ) -> None:
2025-05-07T20:33:07.5577504Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5577577Z     
2025-05-07T20:33:07.5577750Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5577826Z     
2025-05-07T20:33:07.5577917Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5578038Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5578133Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5578211Z         x0 = x[:, :D]
2025-05-07T20:33:07.5578293Z         x1 = x[:, D:]
2025-05-07T20:33:07.5578363Z     
2025-05-07T20:33:07.5578449Z         if contiguous:
2025-05-07T20:33:07.5578542Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5578628Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5578699Z     
2025-05-07T20:33:07.5578793Z         if scale_ub is not None:
2025-05-07T20:33:07.5578896Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5579026Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5579110Z             )
2025-05-07T20:33:07.5579236Z         else:
2025-05-07T20:33:07.5579328Z             scale_ub_tensor = None
2025-05-07T20:33:07.5579410Z     
2025-05-07T20:33:07.5579536Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5579623Z             op = silu_mul_quant
2025-05-07T20:33:07.5579708Z             if compiled:
2025-05-07T20:33:07.5579811Z                 op = torch.compile(op)
2025-05-07T20:33:07.5579919Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5579991Z     
2025-05-07T20:33:07.5580081Z         y_fp8, y_scale = fn()
2025-05-07T20:33:07.5580203Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:07.5580269Z     
2025-05-07T20:33:07.5580401Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5580502Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:07.5580601Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:07.5580720Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:07.5580914Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.5580991Z     
2025-05-07T20:33:07.5581096Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:07.5581100Z 
2025-05-07T20:33:07.5581196Z moe/activation_test.py:126: 
2025-05-07T20:33:07.5581325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5581477Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:07.5581609Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.5582183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:07.5582289Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:07.5582649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5582875Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5583251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:07.5583512Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.5583955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:07.5584211Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.5584588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:07.5584978Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:07.5590313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:07.5590416Z     fn()
2025-05-07T20:33:07.5590855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:07.5590951Z     self.fn.run(
2025-05-07T20:33:07.5591296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5591398Z     kernel = self.compile(
2025-05-07T20:33:07.5591789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5591965Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5592103Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5592109Z 
2025-05-07T20:33:07.5592312Z self = <triton.compiler.compiler.ASTSource object at 0x7f89191d7130>
2025-05-07T20:33:07.5593104Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5593692Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8918da8f70>}
2025-05-07T20:33:07.5594441Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5594634Z context = <triton._C.libtriton.ir.context object at 0x7f8918f4f430>
2025-05-07T20:33:07.5594638Z 
2025-05-07T20:33:07.5594800Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5595073Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5595180Z                            module_map=module_map)
2025-05-07T20:33:07.5595338Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5595488Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:07.5595571Z E       ^
2025-05-07T20:33:07.5595933Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5595937Z 
2025-05-07T20:33:07.5596362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5596406Z 
2025-05-07T20:33:07.5596507Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5596732Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5596805Z     T=128,
2025-05-07T20:33:07.5596889Z     D=5120,
2025-05-07T20:33:07.5596976Z     scale_ub=None,
2025-05-07T20:33:07.5597059Z     contiguous=True,
2025-05-07T20:33:07.5597138Z     compiled=True,
2025-05-07T20:33:07.5597208Z )
2025-05-07T20:33:07.5597428Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5597602Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.5597607Z 
2025-05-07T20:33:07.5597682Z     @given(
2025-05-07T20:33:07.5597804Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5597909Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5598062Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5598181Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5598300Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5598371Z     )
2025-05-07T20:33:07.5598614Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5598709Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5598781Z         self,
2025-05-07T20:33:07.5598855Z         T: int,
2025-05-07T20:33:07.5598926Z         D: int,
2025-05-07T20:33:07.5599021Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5599112Z         contiguous: bool,
2025-05-07T20:33:07.5599202Z         compiled: bool,
2025-05-07T20:33:07.5599283Z     ) -> None:
2025-05-07T20:33:07.5599379Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5599451Z     
2025-05-07T20:33:07.5599616Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5599690Z     
2025-05-07T20:33:07.5599783Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5599906Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5599997Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5600074Z         x0 = x[:, :D]
2025-05-07T20:33:07.5600150Z         x1 = x[:, D:]
2025-05-07T20:33:07.5600228Z     
2025-05-07T20:33:07.5600311Z         if contiguous:
2025-05-07T20:33:07.5600403Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5600494Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5600561Z     
2025-05-07T20:33:07.5600656Z         if scale_ub is not None:
2025-05-07T20:33:07.5600759Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5600895Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5601023Z             )
2025-05-07T20:33:07.5601098Z         else:
2025-05-07T20:33:07.5601190Z             scale_ub_tensor = None
2025-05-07T20:33:07.5601267Z     
2025-05-07T20:33:07.5601393Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5601487Z             op = silu_mul_quant
2025-05-07T20:33:07.5601572Z             if compiled:
2025-05-07T20:33:07.5601669Z                 op = torch.compile(op)
2025-05-07T20:33:07.5601778Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5601856Z     
2025-05-07T20:33:07.5601947Z         y_fp8, y_scale = fn()
2025-05-07T20:33:07.5602071Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:07.5602142Z     
2025-05-07T20:33:07.5602276Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5602383Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:07.5602484Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:07.5602655Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:07.5602802Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.5602873Z     
2025-05-07T20:33:07.5602973Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:07.5602982Z 
2025-05-07T20:33:07.5603123Z moe/activation_test.py:126: 
2025-05-07T20:33:07.5603256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5603361Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:07.5603491Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.5604300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:07.5604407Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:07.5604764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5605002Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5605366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:07.5605718Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.5606128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:07.5606380Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.5606762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:07.5606926Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:07.5607263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:07.5607352Z     fn()
2025-05-07T20:33:07.5607751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:07.5607829Z     self.fn.run(
2025-05-07T20:33:07.5608173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5608267Z     kernel = self.compile(
2025-05-07T20:33:07.5608649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5608822Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5608952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5608957Z 
2025-05-07T20:33:07.5609164Z self = <triton.compiler.compiler.ASTSource object at 0x7f8918d1c0a0>
2025-05-07T20:33:07.5609951Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5610544Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8919050a60>}
2025-05-07T20:33:07.5611304Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5611491Z context = <triton._C.libtriton.ir.context object at 0x7f8918b8cc30>
2025-05-07T20:33:07.5611496Z 
2025-05-07T20:33:07.5611659Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5611922Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5612041Z                            module_map=module_map)
2025-05-07T20:33:07.5612266Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5612375Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:07.5612451Z E       ^
2025-05-07T20:33:07.5612810Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5612877Z 
2025-05-07T20:33:07.5613289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5613298Z 
2025-05-07T20:33:07.5613398Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5613618Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5613698Z     T=4096,
2025-05-07T20:33:07.5613768Z     D=5120,
2025-05-07T20:33:07.5613847Z     scale_ub=None,
2025-05-07T20:33:07.5613931Z     contiguous=True,
2025-05-07T20:33:07.5614009Z     compiled=True,
2025-05-07T20:33:07.5614079Z )
2025-05-07T20:33:07.5614303Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5614474Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.5614478Z 
2025-05-07T20:33:07.5614555Z     @given(
2025-05-07T20:33:07.5614671Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5614813Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5614938Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5615050Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5615161Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5615239Z     )
2025-05-07T20:33:07.5615483Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5615574Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5615651Z         self,
2025-05-07T20:33:07.5615726Z         T: int,
2025-05-07T20:33:07.5615799Z         D: int,
2025-05-07T20:33:07.5615905Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5615992Z         contiguous: bool,
2025-05-07T20:33:07.5616080Z         compiled: bool,
2025-05-07T20:33:07.5616155Z     ) -> None:
2025-05-07T20:33:07.5616246Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5616319Z     
2025-05-07T20:33:07.5616486Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5616558Z     
2025-05-07T20:33:07.5616653Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5616774Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5616860Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5616941Z         x0 = x[:, :D]
2025-05-07T20:33:07.5617017Z         x1 = x[:, D:]
2025-05-07T20:33:07.5617092Z     
2025-05-07T20:33:07.5617183Z         if contiguous:
2025-05-07T20:33:07.5617270Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5617357Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5617431Z     
2025-05-07T20:33:07.5617518Z         if scale_ub is not None:
2025-05-07T20:33:07.5617673Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5617803Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5617876Z             )
2025-05-07T20:33:07.5617955Z         else:
2025-05-07T20:33:07.5618047Z             scale_ub_tensor = None
2025-05-07T20:33:07.5618118Z     
2025-05-07T20:33:07.5618253Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5618346Z             op = silu_mul_quant
2025-05-07T20:33:07.5618426Z             if compiled:
2025-05-07T20:33:07.5618527Z                 op = torch.compile(op)
2025-05-07T20:33:07.5618630Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5618702Z     
2025-05-07T20:33:07.5618788Z         y_fp8, y_scale = fn()
2025-05-07T20:33:07.5618905Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:07.5618986Z     
2025-05-07T20:33:07.5619120Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5619263Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:07.5619367Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:07.5619486Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:07.5619622Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.5619698Z     
2025-05-07T20:33:07.5619870Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:07.5619874Z 
2025-05-07T20:33:07.5619970Z moe/activation_test.py:126: 
2025-05-07T20:33:07.5620095Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5620194Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:07.5620330Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.5620889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:07.5620986Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:07.5621351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5621573Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5621987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:07.5622246Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.5622642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:07.5622897Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.5623269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:07.5623437Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:07.5623787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:07.5623867Z     fn()
2025-05-07T20:33:07.5624268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:07.5624347Z     self.fn.run(
2025-05-07T20:33:07.5624690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5624781Z     kernel = self.compile(
2025-05-07T20:33:07.5625152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5625329Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5625451Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5625456Z 
2025-05-07T20:33:07.5625659Z self = <triton.compiler.compiler.ASTSource object at 0x7f891903aee0>
2025-05-07T20:33:07.5626499Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5627007Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8918c30670>}
2025-05-07T20:33:07.5627767Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5627954Z context = <triton._C.libtriton.ir.context object at 0x7f8918774f30>
2025-05-07T20:33:07.5627959Z 
2025-05-07T20:33:07.5628119Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5628422Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5628533Z                            module_map=module_map)
2025-05-07T20:33:07.5628702Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5628799Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:07.5628874Z E       ^
2025-05-07T20:33:07.5629283Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5629288Z 
2025-05-07T20:33:07.5629699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5629707Z 
2025-05-07T20:33:07.5629880Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5630102Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5630176Z     T=16384,
2025-05-07T20:33:07.5630258Z     D=5120,
2025-05-07T20:33:07.5630335Z     scale_ub=None,
2025-05-07T20:33:07.5630418Z     contiguous=True,
2025-05-07T20:33:07.5630508Z     compiled=True,
2025-05-07T20:33:07.5630580Z )
2025-05-07T20:33:07.5630793Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5630969Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.5630974Z 
2025-05-07T20:33:07.5631097Z     @given(
2025-05-07T20:33:07.5631218Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5631311Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5631422Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5631542Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5631656Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5631729Z     )
2025-05-07T20:33:07.5631977Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5632069Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5632143Z         self,
2025-05-07T20:33:07.5632226Z         T: int,
2025-05-07T20:33:07.5632300Z         D: int,
2025-05-07T20:33:07.5632397Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5632487Z         contiguous: bool,
2025-05-07T20:33:07.5632568Z         compiled: bool,
2025-05-07T20:33:07.5632648Z     ) -> None:
2025-05-07T20:33:07.5632740Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5632815Z     
2025-05-07T20:33:07.5632985Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5633056Z     
2025-05-07T20:33:07.5633143Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5633269Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5633355Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5633432Z         x0 = x[:, :D]
2025-05-07T20:33:07.5633516Z         x1 = x[:, D:]
2025-05-07T20:33:07.5633586Z     
2025-05-07T20:33:07.5633665Z         if contiguous:
2025-05-07T20:33:07.5633754Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5633839Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5633958Z     
2025-05-07T20:33:07.5634048Z         if scale_ub is not None:
2025-05-07T20:33:07.5634149Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5634286Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5634358Z             )
2025-05-07T20:33:07.5634436Z         else:
2025-05-07T20:33:07.5634533Z             scale_ub_tensor = None
2025-05-07T20:33:07.5634604Z     
2025-05-07T20:33:07.5634730Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5634818Z             op = silu_mul_quant
2025-05-07T20:33:07.5634897Z             if compiled:
2025-05-07T20:33:07.5634992Z                 op = torch.compile(op)
2025-05-07T20:33:07.5635098Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5635167Z     
2025-05-07T20:33:07.5635259Z         y_fp8, y_scale = fn()
2025-05-07T20:33:07.5635375Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:07.5635445Z     
2025-05-07T20:33:07.5635627Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5635728Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:07.5635824Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:07.5635944Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:07.5636082Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.5636189Z     
2025-05-07T20:33:07.5636291Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:07.5636296Z 
2025-05-07T20:33:07.5636392Z moe/activation_test.py:126: 
2025-05-07T20:33:07.5636518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5636620Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:07.5636748Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.5637310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:07.5637414Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:07.5637768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5637993Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5638397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:07.5638659Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.5639050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:07.5639298Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.5639670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:07.5639837Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:07.5640179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:07.5640254Z     fn()
2025-05-07T20:33:07.5640651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:07.5640737Z     self.fn.run(
2025-05-07T20:33:07.5641068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5641156Z     kernel = self.compile(
2025-05-07T20:33:07.5641536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5641708Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5641834Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5641839Z 
2025-05-07T20:33:07.5642083Z self = <triton.compiler.compiler.ASTSource object at 0x7f8918ea5a90>
2025-05-07T20:33:07.5642868Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5643377Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8918befc10>}
2025-05-07T20:33:07.5644120Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5644312Z context = <triton._C.libtriton.ir.context object at 0x7f89183caf70>
2025-05-07T20:33:07.5644316Z 
2025-05-07T20:33:07.5644516Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5644780Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5644887Z                            module_map=module_map)
2025-05-07T20:33:07.5645047Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5645188Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:07.5645262Z E       ^
2025-05-07T20:33:07.5645610Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5645615Z 
2025-05-07T20:33:07.5646027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5646032Z 
2025-05-07T20:33:07.5646130Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5646349Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5646420Z     T=1,
2025-05-07T20:33:07.5646492Z     D=5120,
2025-05-07T20:33:07.5646577Z     scale_ub=1200.0,
2025-05-07T20:33:07.5646658Z     contiguous=True,
2025-05-07T20:33:07.5646735Z     compiled=True,
2025-05-07T20:33:07.5646811Z )
2025-05-07T20:33:07.5647024Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5647227Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.5647235Z 
2025-05-07T20:33:07.5647317Z     @given(
2025-05-07T20:33:07.5647432Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5647537Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5647650Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5647761Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5647877Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5647948Z     )
2025-05-07T20:33:07.5648192Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5648295Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5648369Z         self,
2025-05-07T20:33:07.5648442Z         T: int,
2025-05-07T20:33:07.5648519Z         D: int,
2025-05-07T20:33:07.5648614Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5648698Z         contiguous: bool,
2025-05-07T20:33:07.5648785Z         compiled: bool,
2025-05-07T20:33:07.5648864Z     ) -> None:
2025-05-07T20:33:07.5648958Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5649028Z     
2025-05-07T20:33:07.5649194Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5649273Z     
2025-05-07T20:33:07.5649360Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5649480Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5649573Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5649648Z         x0 = x[:, :D]
2025-05-07T20:33:07.5649723Z         x1 = x[:, D:]
2025-05-07T20:33:07.5649794Z     
2025-05-07T20:33:07.5649873Z         if contiguous:
2025-05-07T20:33:07.5650006Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5650096Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5650164Z     
2025-05-07T20:33:07.5650254Z         if scale_ub is not None:
2025-05-07T20:33:07.5650363Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5650496Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5650577Z             )
2025-05-07T20:33:07.5650656Z         else:
2025-05-07T20:33:07.5650748Z             scale_ub_tensor = None
2025-05-07T20:33:07.5650822Z     
2025-05-07T20:33:07.5650947Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5651035Z             op = silu_mul_quant
2025-05-07T20:33:07.5651119Z             if compiled:
2025-05-07T20:33:07.5651219Z                 op = torch.compile(op)
2025-05-07T20:33:07.5651324Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5651394Z     
2025-05-07T20:33:07.5651482Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5651486Z 
2025-05-07T20:33:07.5651656Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5651782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5651880Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5651983Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5652349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.5652484Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.5652973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5653069Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5653427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5653649Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5653995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5654089Z     kernel = self.compile(
2025-05-07T20:33:07.5654463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5654677Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5654804Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5654809Z 
2025-05-07T20:33:07.5655011Z self = <triton.compiler.compiler.ASTSource object at 0x7f891851f5b0>
2025-05-07T20:33:07.5655796Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5656302Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8918457670>}
2025-05-07T20:33:07.5657052Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5657247Z context = <triton._C.libtriton.ir.context object at 0x7f8917d91cb0>
2025-05-07T20:33:07.5657251Z 
2025-05-07T20:33:07.5657417Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5657676Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5657780Z                            module_map=module_map)
2025-05-07T20:33:07.5657944Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5658039Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5658114Z E       ^
2025-05-07T20:33:07.5658470Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5658532Z 
2025-05-07T20:33:07.5658940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5658945Z 
2025-05-07T20:33:07.5659050Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5659275Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5659350Z     T=1,
2025-05-07T20:33:07.5659429Z     D=5120,
2025-05-07T20:33:07.5659507Z     scale_ub=None,
2025-05-07T20:33:07.5659591Z     contiguous=False,
2025-05-07T20:33:07.5659673Z     compiled=True,
2025-05-07T20:33:07.5659743Z )
2025-05-07T20:33:07.5659961Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5660122Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:07.5660127Z 
2025-05-07T20:33:07.5660198Z     @given(
2025-05-07T20:33:07.5660363Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5660462Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5660576Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5660695Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5660810Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5660923Z     )
2025-05-07T20:33:07.5661170Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5661262Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5661336Z         self,
2025-05-07T20:33:07.5661408Z         T: int,
2025-05-07T20:33:07.5661482Z         D: int,
2025-05-07T20:33:07.5661602Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5661695Z         contiguous: bool,
2025-05-07T20:33:07.5661794Z         compiled: bool,
2025-05-07T20:33:07.5661871Z     ) -> None:
2025-05-07T20:33:07.5661961Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5662031Z     
2025-05-07T20:33:07.5662206Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5662273Z     
2025-05-07T20:33:07.5662362Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5662485Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5662570Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5662695Z         x0 = x[:, :D]
2025-05-07T20:33:07.5662773Z         x1 = x[:, D:]
2025-05-07T20:33:07.5662842Z     
2025-05-07T20:33:07.5662925Z         if contiguous:
2025-05-07T20:33:07.5663014Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5663100Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5663173Z     
2025-05-07T20:33:07.5663257Z         if scale_ub is not None:
2025-05-07T20:33:07.5663357Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5663490Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5663566Z             )
2025-05-07T20:33:07.5663639Z         else:
2025-05-07T20:33:07.5663735Z             scale_ub_tensor = None
2025-05-07T20:33:07.5663805Z     
2025-05-07T20:33:07.5663930Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5664023Z             op = silu_mul_quant
2025-05-07T20:33:07.5664104Z             if compiled:
2025-05-07T20:33:07.5664204Z                 op = torch.compile(op)
2025-05-07T20:33:07.5664313Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5664383Z     
2025-05-07T20:33:07.5664473Z         y_fp8, y_scale = fn()
2025-05-07T20:33:07.5664592Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:07.5664661Z     
2025-05-07T20:33:07.5664798Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5664897Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:07.5664992Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:07.5665114Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:07.5665251Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.5665375Z     
2025-05-07T20:33:07.5665471Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:07.5665475Z 
2025-05-07T20:33:07.5665573Z moe/activation_test.py:126: 
2025-05-07T20:33:07.5665701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5665811Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:07.5665941Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.5666499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:07.5666597Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:07.5666957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5667179Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5667582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:07.5667845Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.5668242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:07.5668537Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.5668906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:07.5669069Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:07.5669408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:07.5669483Z     fn()
2025-05-07T20:33:07.5670003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:07.5670096Z     self.fn.run(
2025-05-07T20:33:07.5670425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5670517Z     kernel = self.compile(
2025-05-07T20:33:07.5670935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5671116Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5671244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5671249Z 
2025-05-07T20:33:07.5671450Z self = <triton.compiler.compiler.ASTSource object at 0x7f89182161f0>
2025-05-07T20:33:07.5672288Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5672797Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89184b79d0>}
2025-05-07T20:33:07.5673540Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5673732Z context = <triton._C.libtriton.ir.context object at 0x7f8917deccb0>
2025-05-07T20:33:07.5673737Z 
2025-05-07T20:33:07.5673898Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5674160Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5674264Z                            module_map=module_map)
2025-05-07T20:33:07.5674423Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5674526Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:07.5674644Z E       ^
2025-05-07T20:33:07.5675005Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5675013Z 
2025-05-07T20:33:07.5675422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5675428Z 
2025-05-07T20:33:07.5675526Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5675749Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5675821Z     T=1,
2025-05-07T20:33:07.5675891Z     D=5120,
2025-05-07T20:33:07.5675976Z     scale_ub=None,
2025-05-07T20:33:07.5676057Z     contiguous=True,
2025-05-07T20:33:07.5676135Z     compiled=False,
2025-05-07T20:33:07.5676212Z )
2025-05-07T20:33:07.5676430Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5676595Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.5676712Z 
2025-05-07T20:33:07.5676784Z     @given(
2025-05-07T20:33:07.5676901Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5677000Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5677111Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5677279Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5677396Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5677468Z     )
2025-05-07T20:33:07.5677711Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5677808Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5677886Z         self,
2025-05-07T20:33:07.5677963Z         T: int,
2025-05-07T20:33:07.5678038Z         D: int,
2025-05-07T20:33:07.5678133Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5678223Z         contiguous: bool,
2025-05-07T20:33:07.5678305Z         compiled: bool,
2025-05-07T20:33:07.5678380Z     ) -> None:
2025-05-07T20:33:07.5678481Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5678551Z     
2025-05-07T20:33:07.5678717Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5678794Z     
2025-05-07T20:33:07.5678885Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5679050Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5679143Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5679221Z         x0 = x[:, :D]
2025-05-07T20:33:07.5679300Z         x1 = x[:, D:]
2025-05-07T20:33:07.5679374Z     
2025-05-07T20:33:07.5679452Z         if contiguous:
2025-05-07T20:33:07.5679549Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5679636Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5679705Z     
2025-05-07T20:33:07.5679795Z         if scale_ub is not None:
2025-05-07T20:33:07.5679897Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5680028Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5680109Z             )
2025-05-07T20:33:07.5680182Z         else:
2025-05-07T20:33:07.5680272Z             scale_ub_tensor = None
2025-05-07T20:33:07.5680346Z     
2025-05-07T20:33:07.5680472Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5680560Z             op = silu_mul_quant
2025-05-07T20:33:07.5680650Z             if compiled:
2025-05-07T20:33:07.5680746Z                 op = torch.compile(op)
2025-05-07T20:33:07.5680852Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5680920Z     
2025-05-07T20:33:07.5681008Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5681012Z 
2025-05-07T20:33:07.5681112Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5681237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5681333Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5681437Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5681937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5682106Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5682461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5682682Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5683022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5683112Z     kernel = self.compile(
2025-05-07T20:33:07.5683489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5683663Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5683784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5683788Z 
2025-05-07T20:33:07.5684029Z self = <triton.compiler.compiler.ASTSource object at 0x7f8918219b50>
2025-05-07T20:33:07.5684817Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5685364Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8918480940>}
2025-05-07T20:33:07.5686114Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5686301Z context = <triton._C.libtriton.ir.context object at 0x7f8917d18230>
2025-05-07T20:33:07.5686306Z 
2025-05-07T20:33:07.5686474Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5686736Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5686845Z                            module_map=module_map)
2025-05-07T20:33:07.5687005Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5687098Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5687223Z E       ^
2025-05-07T20:33:07.5687575Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5687580Z 
2025-05-07T20:33:07.5687988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5687993Z 
2025-05-07T20:33:07.5688094Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5688309Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5688387Z     T=128,
2025-05-07T20:33:07.5688461Z     D=5120,
2025-05-07T20:33:07.5688541Z     scale_ub=None,
2025-05-07T20:33:07.5688636Z     contiguous=False,
2025-05-07T20:33:07.5688717Z     compiled=True,
2025-05-07T20:33:07.5688788Z )
2025-05-07T20:33:07.5689004Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5689170Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:07.5689180Z 
2025-05-07T20:33:07.5689251Z     @given(
2025-05-07T20:33:07.5689371Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5689467Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5689582Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5689694Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5689808Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5689886Z     )
2025-05-07T20:33:07.5690127Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5690217Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5690340Z         self,
2025-05-07T20:33:07.5690411Z         T: int,
2025-05-07T20:33:07.5690485Z         D: int,
2025-05-07T20:33:07.5690583Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5690666Z         contiguous: bool,
2025-05-07T20:33:07.5690746Z         compiled: bool,
2025-05-07T20:33:07.5690825Z     ) -> None:
2025-05-07T20:33:07.5690921Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5690995Z     
2025-05-07T20:33:07.5691160Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5691233Z     
2025-05-07T20:33:07.5691325Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5691445Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5691530Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5691620Z         x0 = x[:, :D]
2025-05-07T20:33:07.5691711Z         x1 = x[:, D:]
2025-05-07T20:33:07.5691783Z     
2025-05-07T20:33:07.5691888Z         if contiguous:
2025-05-07T20:33:07.5691975Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5692106Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5692182Z     
2025-05-07T20:33:07.5692269Z         if scale_ub is not None:
2025-05-07T20:33:07.5692375Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5692507Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5692619Z             )
2025-05-07T20:33:07.5692701Z         else:
2025-05-07T20:33:07.5692793Z             scale_ub_tensor = None
2025-05-07T20:33:07.5692863Z     
2025-05-07T20:33:07.5692993Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5693077Z             op = silu_mul_quant
2025-05-07T20:33:07.5693159Z             if compiled:
2025-05-07T20:33:07.5693256Z                 op = torch.compile(op)
2025-05-07T20:33:07.5693357Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5693427Z     
2025-05-07T20:33:07.5693519Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5693524Z 
2025-05-07T20:33:07.5693616Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5693751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5693846Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5693941Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5694354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.5694449Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.5694941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5695038Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5695390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5695615Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5695954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5696046Z     kernel = self.compile(
2025-05-07T20:33:07.5696428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5696598Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5696730Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5696741Z 
2025-05-07T20:33:07.5696942Z self = <triton.compiler.compiler.ASTSource object at 0x7f891826e0a0>
2025-05-07T20:33:07.5697721Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5698232Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917cbd040>}
2025-05-07T20:33:07.5699019Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5699209Z context = <triton._C.libtriton.ir.context object at 0x7f89181088b0>
2025-05-07T20:33:07.5699216Z 
2025-05-07T20:33:07.5699377Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5699635Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5699743Z                            module_map=module_map)
2025-05-07T20:33:07.5699905Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5700003Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5700074Z E       ^
2025-05-07T20:33:07.5700463Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5700471Z 
2025-05-07T20:33:07.5700885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5700890Z 
2025-05-07T20:33:07.5700989Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5701209Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5701324Z     T=128,
2025-05-07T20:33:07.5701398Z     D=7168,
2025-05-07T20:33:07.5701482Z     scale_ub=1200.0,
2025-05-07T20:33:07.5701565Z     contiguous=False,
2025-05-07T20:33:07.5701668Z     compiled=False,
2025-05-07T20:33:07.5701745Z )
2025-05-07T20:33:07.5701990Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5702158Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:07.5702163Z 
2025-05-07T20:33:07.5702239Z     @given(
2025-05-07T20:33:07.5702353Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5702452Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5702568Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5702680Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5702793Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5702863Z     )
2025-05-07T20:33:07.5703149Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5703246Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5703320Z         self,
2025-05-07T20:33:07.5703391Z         T: int,
2025-05-07T20:33:07.5703468Z         D: int,
2025-05-07T20:33:07.5703564Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5703649Z         contiguous: bool,
2025-05-07T20:33:07.5703945Z         compiled: bool,
2025-05-07T20:33:07.5704056Z     ) -> None:
2025-05-07T20:33:07.5704163Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5704235Z     
2025-05-07T20:33:07.5704409Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5704494Z     
2025-05-07T20:33:07.5704586Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5704707Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5704794Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5704870Z         x0 = x[:, :D]
2025-05-07T20:33:07.5704953Z         x1 = x[:, D:]
2025-05-07T20:33:07.5705028Z     
2025-05-07T20:33:07.5705105Z         if contiguous:
2025-05-07T20:33:07.5705192Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5705282Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5705353Z     
2025-05-07T20:33:07.5705438Z         if scale_ub is not None:
2025-05-07T20:33:07.5705551Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5705682Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5705757Z             )
2025-05-07T20:33:07.5705833Z         else:
2025-05-07T20:33:07.5705923Z             scale_ub_tensor = None
2025-05-07T20:33:07.5706082Z     
2025-05-07T20:33:07.5706207Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5706294Z             op = silu_mul_quant
2025-05-07T20:33:07.5706381Z             if compiled:
2025-05-07T20:33:07.5706484Z                 op = torch.compile(op)
2025-05-07T20:33:07.5706590Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5706665Z     
2025-05-07T20:33:07.5706754Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5706758Z 
2025-05-07T20:33:07.5706853Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5706985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5707082Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5707189Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5707697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5707796Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5708221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5708445Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5708788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5708943Z     kernel = self.compile(
2025-05-07T20:33:07.5709329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5709509Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5709635Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5709639Z 
2025-05-07T20:33:07.5709917Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917d419d0>
2025-05-07T20:33:07.5710710Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5711303Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917cbdd30>}
2025-05-07T20:33:07.5712056Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5712250Z context = <triton._C.libtriton.ir.context object at 0x7f8917d024b0>
2025-05-07T20:33:07.5712255Z 
2025-05-07T20:33:07.5712424Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5712686Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5712794Z                            module_map=module_map)
2025-05-07T20:33:07.5712961Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5713060Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5713135Z E       ^
2025-05-07T20:33:07.5713494Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5713501Z 
2025-05-07T20:33:07.5713916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5713920Z 
2025-05-07T20:33:07.5714066Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5714323Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5714436Z     T=128,
2025-05-07T20:33:07.5718668Z     D=5120,
2025-05-07T20:33:07.5718772Z     scale_ub=None,
2025-05-07T20:33:07.5718861Z     contiguous=False,
2025-05-07T20:33:07.5718948Z     compiled=False,
2025-05-07T20:33:07.5719021Z )
2025-05-07T20:33:07.5719323Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5719501Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.5719506Z 
2025-05-07T20:33:07.5719581Z     @given(
2025-05-07T20:33:07.5719703Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5719812Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5719927Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5720041Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5720156Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5720230Z     )
2025-05-07T20:33:07.5720481Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5720575Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5720651Z         self,
2025-05-07T20:33:07.5720728Z         T: int,
2025-05-07T20:33:07.5720801Z         D: int,
2025-05-07T20:33:07.5720944Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5721037Z         contiguous: bool,
2025-05-07T20:33:07.5721122Z         compiled: bool,
2025-05-07T20:33:07.5721200Z     ) -> None:
2025-05-07T20:33:07.5721298Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5721370Z     
2025-05-07T20:33:07.5721543Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5721662Z     
2025-05-07T20:33:07.5721753Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5721878Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5721963Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5722041Z         x0 = x[:, :D]
2025-05-07T20:33:07.5722124Z         x1 = x[:, D:]
2025-05-07T20:33:07.5722194Z     
2025-05-07T20:33:07.5722276Z         if contiguous:
2025-05-07T20:33:07.5722371Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5722458Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5722529Z     
2025-05-07T20:33:07.5722623Z         if scale_ub is not None:
2025-05-07T20:33:07.5722734Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5722871Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5722952Z             )
2025-05-07T20:33:07.5723028Z         else:
2025-05-07T20:33:07.5723124Z             scale_ub_tensor = None
2025-05-07T20:33:07.5723197Z     
2025-05-07T20:33:07.5723367Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5723459Z             op = silu_mul_quant
2025-05-07T20:33:07.5723542Z             if compiled:
2025-05-07T20:33:07.5723642Z                 op = torch.compile(op)
2025-05-07T20:33:07.5723750Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5723824Z     
2025-05-07T20:33:07.5723913Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5723917Z 
2025-05-07T20:33:07.5724021Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5724149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5724258Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5724359Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5724863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5724965Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5725331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5725554Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5725895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5725987Z     kernel = self.compile(
2025-05-07T20:33:07.5726368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5726544Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5726715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5726720Z 
2025-05-07T20:33:07.5726928Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917bf2850>
2025-05-07T20:33:07.5727712Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5728223Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917c82310>}
2025-05-07T20:33:07.5728967Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5729193Z context = <triton._C.libtriton.ir.context object at 0x7f8917c879f0>
2025-05-07T20:33:07.5729201Z 
2025-05-07T20:33:07.5729371Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5729633Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5729743Z                            module_map=module_map)
2025-05-07T20:33:07.5729945Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5730044Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5730123Z E       ^
2025-05-07T20:33:07.5730486Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5730491Z 
2025-05-07T20:33:07.5730904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5730908Z 
2025-05-07T20:33:07.5731008Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5731231Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5731315Z     T=128,
2025-05-07T20:33:07.5731391Z     D=5120,
2025-05-07T20:33:07.5731472Z     scale_ub=1200.0,
2025-05-07T20:33:07.5731557Z     contiguous=True,
2025-05-07T20:33:07.5731656Z     compiled=False,
2025-05-07T20:33:07.5731734Z )
2025-05-07T20:33:07.5732018Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5732190Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.5732194Z 
2025-05-07T20:33:07.5732272Z     @given(
2025-05-07T20:33:07.5732388Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5732485Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5732601Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5732714Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5732825Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5732904Z     )
2025-05-07T20:33:07.5733152Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5733244Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5733325Z         self,
2025-05-07T20:33:07.5733399Z         T: int,
2025-05-07T20:33:07.5733479Z         D: int,
2025-05-07T20:33:07.5733580Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5733666Z         contiguous: bool,
2025-05-07T20:33:07.5733755Z         compiled: bool,
2025-05-07T20:33:07.5733831Z     ) -> None:
2025-05-07T20:33:07.5733923Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5733997Z     
2025-05-07T20:33:07.5734167Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5734241Z     
2025-05-07T20:33:07.5734333Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5734455Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5734541Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5734622Z         x0 = x[:, :D]
2025-05-07T20:33:07.5734749Z         x1 = x[:, D:]
2025-05-07T20:33:07.5734819Z     
2025-05-07T20:33:07.5734903Z         if contiguous:
2025-05-07T20:33:07.5734991Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5735083Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5735153Z     
2025-05-07T20:33:07.5735242Z         if scale_ub is not None:
2025-05-07T20:33:07.5735353Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5735491Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5735566Z             )
2025-05-07T20:33:07.5735644Z         else:
2025-05-07T20:33:07.5735738Z             scale_ub_tensor = None
2025-05-07T20:33:07.5735807Z     
2025-05-07T20:33:07.5735943Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5736032Z             op = silu_mul_quant
2025-05-07T20:33:07.5736115Z             if compiled:
2025-05-07T20:33:07.5736217Z                 op = torch.compile(op)
2025-05-07T20:33:07.5736321Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5736462Z     
2025-05-07T20:33:07.5736552Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5736557Z 
2025-05-07T20:33:07.5736651Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5736782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5736885Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5737023Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5737527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5737626Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5737988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5738210Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5738552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5738651Z     kernel = self.compile(
2025-05-07T20:33:07.5739029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5739202Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5739372Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5739377Z 
2025-05-07T20:33:07.5739581Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917c6b550>
2025-05-07T20:33:07.5740363Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5740874Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917c82ee0>}
2025-05-07T20:33:07.5741628Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5741818Z context = <triton._C.libtriton.ir.context object at 0x7f89181ece70>
2025-05-07T20:33:07.5741827Z 
2025-05-07T20:33:07.5742015Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5742307Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5742414Z                            module_map=module_map)
2025-05-07T20:33:07.5742577Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5742675Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5742751Z E       ^
2025-05-07T20:33:07.5743108Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5743153Z 
2025-05-07T20:33:07.5743567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5743572Z 
2025-05-07T20:33:07.5743674Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5743899Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5743977Z     T=1,
2025-05-07T20:33:07.5744053Z     D=7168,
2025-05-07T20:33:07.5744134Z     scale_ub=1200.0,
2025-05-07T20:33:07.5744218Z     contiguous=True,
2025-05-07T20:33:07.5744302Z     compiled=True,
2025-05-07T20:33:07.5744376Z )
2025-05-07T20:33:07.5744589Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5744757Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.5744762Z 
2025-05-07T20:33:07.5744837Z     @given(
2025-05-07T20:33:07.5744958Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5745099Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5745214Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5745332Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5745446Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5745559Z     )
2025-05-07T20:33:07.5745808Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5745900Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5745977Z         self,
2025-05-07T20:33:07.5746054Z         T: int,
2025-05-07T20:33:07.5746127Z         D: int,
2025-05-07T20:33:07.5746223Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5746309Z         contiguous: bool,
2025-05-07T20:33:07.5746392Z         compiled: bool,
2025-05-07T20:33:07.5746467Z     ) -> None:
2025-05-07T20:33:07.5746562Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5746633Z     
2025-05-07T20:33:07.5746808Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5746886Z     
2025-05-07T20:33:07.5746976Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5747103Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5747188Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5747266Z         x0 = x[:, :D]
2025-05-07T20:33:07.5747392Z         x1 = x[:, D:]
2025-05-07T20:33:07.5747465Z     
2025-05-07T20:33:07.5747546Z         if contiguous:
2025-05-07T20:33:07.5747638Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5747724Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5747795Z     
2025-05-07T20:33:07.5747885Z         if scale_ub is not None:
2025-05-07T20:33:07.5747989Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5748124Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5748202Z             )
2025-05-07T20:33:07.5748278Z         else:
2025-05-07T20:33:07.5748373Z             scale_ub_tensor = None
2025-05-07T20:33:07.5748445Z     
2025-05-07T20:33:07.5748574Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5748665Z             op = silu_mul_quant
2025-05-07T20:33:07.5748747Z             if compiled:
2025-05-07T20:33:07.5748842Z                 op = torch.compile(op)
2025-05-07T20:33:07.5748949Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5749024Z     
2025-05-07T20:33:07.5749113Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5749117Z 
2025-05-07T20:33:07.5749218Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5749347Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5749450Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5749547Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5750034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.5750127Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.5750623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5750768Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5751127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5751359Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5751697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5751790Z     kernel = self.compile(
2025-05-07T20:33:07.5752168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5752347Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5752473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5752478Z 
2025-05-07T20:33:07.5752729Z self = <triton.compiler.compiler.ASTSource object at 0x7f891807be80>
2025-05-07T20:33:07.5753515Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5754063Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89181f1940>}
2025-05-07T20:33:07.5754809Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5754996Z context = <triton._C.libtriton.ir.context object at 0x7f89183141f0>
2025-05-07T20:33:07.5755000Z 
2025-05-07T20:33:07.5755165Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5755428Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5755532Z                            module_map=module_map)
2025-05-07T20:33:07.5755702Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5755835Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5755914Z E       ^
2025-05-07T20:33:07.5756282Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5756287Z 
2025-05-07T20:33:07.5756697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5756702Z 
2025-05-07T20:33:07.5756802Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5757023Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5757096Z     T=1,
2025-05-07T20:33:07.5757170Z     D=7168,
2025-05-07T20:33:07.5757257Z     scale_ub=1200.0,
2025-05-07T20:33:07.5757347Z     contiguous=False,
2025-05-07T20:33:07.5757430Z     compiled=True,
2025-05-07T20:33:07.5757500Z )
2025-05-07T20:33:07.5757723Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5757889Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:07.5757896Z 
2025-05-07T20:33:07.5757971Z     @given(
2025-05-07T20:33:07.5758091Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5758189Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5758303Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5758421Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5758533Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5758611Z     )
2025-05-07T20:33:07.5758854Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5758991Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5759068Z         self,
2025-05-07T20:33:07.5759143Z         T: int,
2025-05-07T20:33:07.5759218Z         D: int,
2025-05-07T20:33:07.5759318Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5759404Z         contiguous: bool,
2025-05-07T20:33:07.5759490Z         compiled: bool,
2025-05-07T20:33:07.5759573Z     ) -> None:
2025-05-07T20:33:07.5759666Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5759739Z     
2025-05-07T20:33:07.5759908Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5759980Z     
2025-05-07T20:33:07.5760073Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5760196Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5760282Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5760364Z         x0 = x[:, :D]
2025-05-07T20:33:07.5760440Z         x1 = x[:, D:]
2025-05-07T20:33:07.5760510Z     
2025-05-07T20:33:07.5760593Z         if contiguous:
2025-05-07T20:33:07.5760728Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5760819Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5760892Z     
2025-05-07T20:33:07.5760979Z         if scale_ub is not None:
2025-05-07T20:33:07.5761081Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5761218Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5761340Z             )
2025-05-07T20:33:07.5761416Z         else:
2025-05-07T20:33:07.5761512Z             scale_ub_tensor = None
2025-05-07T20:33:07.5761582Z     
2025-05-07T20:33:07.5761711Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5761799Z             op = silu_mul_quant
2025-05-07T20:33:07.5761883Z             if compiled:
2025-05-07T20:33:07.5761983Z                 op = torch.compile(op)
2025-05-07T20:33:07.5762086Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5762157Z     
2025-05-07T20:33:07.5762251Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5762255Z 
2025-05-07T20:33:07.5762357Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5762482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5762585Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5762680Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5763092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.5763186Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.5763681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5763782Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5764138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5764358Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5764702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5764796Z     kernel = self.compile(
2025-05-07T20:33:07.5765176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5765351Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5765479Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5765483Z 
2025-05-07T20:33:07.5765692Z self = <triton.compiler.compiler.ASTSource object at 0x7f8918329c40>
2025-05-07T20:33:07.5766474Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5766983Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917fec5e0>}
2025-05-07T20:33:07.5767793Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5767989Z context = <triton._C.libtriton.ir.context object at 0x7f8917ea6670>
2025-05-07T20:33:07.5767993Z 
2025-05-07T20:33:07.5768155Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5768416Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5768522Z                            module_map=module_map)
2025-05-07T20:33:07.5768684Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5768779Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5768859Z E       ^
2025-05-07T20:33:07.5769257Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5769266Z 
2025-05-07T20:33:07.5769683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5769687Z 
2025-05-07T20:33:07.5769787Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5770048Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5770126Z     T=1,
2025-05-07T20:33:07.5770198Z     D=7168,
2025-05-07T20:33:07.5770278Z     scale_ub=None,
2025-05-07T20:33:07.5770366Z     contiguous=False,
2025-05-07T20:33:07.5770446Z     compiled=True,
2025-05-07T20:33:07.5770516Z )
2025-05-07T20:33:07.5770732Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5770892Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:07.5770897Z 
2025-05-07T20:33:07.5770977Z     @given(
2025-05-07T20:33:07.5771101Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5771201Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5771317Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5771431Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5771542Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5771663Z     )
2025-05-07T20:33:07.5771935Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5772050Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5772130Z         self,
2025-05-07T20:33:07.5772205Z         T: int,
2025-05-07T20:33:07.5772282Z         D: int,
2025-05-07T20:33:07.5772376Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5772462Z         contiguous: bool,
2025-05-07T20:33:07.5772548Z         compiled: bool,
2025-05-07T20:33:07.5772625Z     ) -> None:
2025-05-07T20:33:07.5772716Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5772792Z     
2025-05-07T20:33:07.5772965Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5773037Z     
2025-05-07T20:33:07.5773127Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5773250Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5773340Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5773423Z         x0 = x[:, :D]
2025-05-07T20:33:07.5773500Z         x1 = x[:, D:]
2025-05-07T20:33:07.5773572Z     
2025-05-07T20:33:07.5773653Z         if contiguous:
2025-05-07T20:33:07.5773745Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5773839Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5773909Z     
2025-05-07T20:33:07.5773996Z         if scale_ub is not None:
2025-05-07T20:33:07.5774101Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5774234Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5774309Z             )
2025-05-07T20:33:07.5774388Z         else:
2025-05-07T20:33:07.5774483Z             scale_ub_tensor = None
2025-05-07T20:33:07.5774599Z     
2025-05-07T20:33:07.5774730Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5774820Z             op = silu_mul_quant
2025-05-07T20:33:07.5774907Z             if compiled:
2025-05-07T20:33:07.5775003Z                 op = torch.compile(op)
2025-05-07T20:33:07.5775112Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5775187Z     
2025-05-07T20:33:07.5775275Z         y_fp8, y_scale = fn()
2025-05-07T20:33:07.5775396Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:07.5775467Z     
2025-05-07T20:33:07.5775601Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5775705Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:07.5775803Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:07.5775922Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:07.5776064Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.5776182Z     
2025-05-07T20:33:07.5776283Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:07.5776288Z 
2025-05-07T20:33:07.5776392Z moe/activation_test.py:126: 
2025-05-07T20:33:07.5776515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5776625Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:07.5776799Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.5777359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:07.5777460Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:07.5777818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5778038Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5778410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:07.5778669Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.5779069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:07.5779364Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.5779736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:07.5779906Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:07.5780243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:07.5780321Z     fn()
2025-05-07T20:33:07.5780716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:07.5780802Z     self.fn.run(
2025-05-07T20:33:07.5781138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5781229Z     kernel = self.compile(
2025-05-07T20:33:07.5781609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5781816Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5781965Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5781969Z 
2025-05-07T20:33:07.5782175Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917ea9f10>
2025-05-07T20:33:07.5782959Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5783516Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917e44160>}
2025-05-07T20:33:07.5784268Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5784460Z context = <triton._C.libtriton.ir.context object at 0x7f8917e45670>
2025-05-07T20:33:07.5784464Z 
2025-05-07T20:33:07.5784633Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5784895Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5785004Z                            module_map=module_map)
2025-05-07T20:33:07.5785165Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5785265Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:07.5785386Z E       ^
2025-05-07T20:33:07.5785751Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5785756Z 
2025-05-07T20:33:07.5786170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5786219Z 
2025-05-07T20:33:07.5786320Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5786539Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5786617Z     T=1,
2025-05-07T20:33:07.5786691Z     D=5120,
2025-05-07T20:33:07.5786773Z     scale_ub=1200.0,
2025-05-07T20:33:07.5786860Z     contiguous=False,
2025-05-07T20:33:07.5786942Z     compiled=True,
2025-05-07T20:33:07.5787012Z )
2025-05-07T20:33:07.5787231Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5787402Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:07.5787412Z 
2025-05-07T20:33:07.5787488Z     @given(
2025-05-07T20:33:07.5787608Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5787706Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5787821Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5787977Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5788092Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5788168Z     )
2025-05-07T20:33:07.5788412Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5788506Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5788586Z         self,
2025-05-07T20:33:07.5788659Z         T: int,
2025-05-07T20:33:07.5788732Z         D: int,
2025-05-07T20:33:07.5788834Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5788919Z         contiguous: bool,
2025-05-07T20:33:07.5789007Z         compiled: bool,
2025-05-07T20:33:07.5789084Z     ) -> None:
2025-05-07T20:33:07.5789186Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5789260Z     
2025-05-07T20:33:07.5789428Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5789501Z     
2025-05-07T20:33:07.5789594Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5789719Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5789886Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5789972Z         x0 = x[:, :D]
2025-05-07T20:33:07.5790051Z         x1 = x[:, D:]
2025-05-07T20:33:07.5790121Z     
2025-05-07T20:33:07.5790205Z         if contiguous:
2025-05-07T20:33:07.5790296Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5790384Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5790458Z     
2025-05-07T20:33:07.5790545Z         if scale_ub is not None:
2025-05-07T20:33:07.5790652Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5790787Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5790938Z             )
2025-05-07T20:33:07.5791020Z         else:
2025-05-07T20:33:07.5791112Z             scale_ub_tensor = None
2025-05-07T20:33:07.5791183Z     
2025-05-07T20:33:07.5791316Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5791404Z             op = silu_mul_quant
2025-05-07T20:33:07.5791488Z             if compiled:
2025-05-07T20:33:07.5791595Z                 op = torch.compile(op)
2025-05-07T20:33:07.5791698Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5791769Z     
2025-05-07T20:33:07.5791860Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5791864Z 
2025-05-07T20:33:07.5791957Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5792086Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5792193Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5792291Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5792708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.5792802Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.5793306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5793406Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5793806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5794032Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5794370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5794462Z     kernel = self.compile(
2025-05-07T20:33:07.5794847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5795020Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5795158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5795166Z 
2025-05-07T20:33:07.5795369Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917e78220>
2025-05-07T20:33:07.5796279Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5796795Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917e44b80>}
2025-05-07T20:33:07.5797552Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5797744Z context = <triton._C.libtriton.ir.context object at 0x7f8917e577b0>
2025-05-07T20:33:07.5797752Z 
2025-05-07T20:33:07.5797916Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5798178Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5798293Z                            module_map=module_map)
2025-05-07T20:33:07.5798453Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5798554Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5798625Z E       ^
2025-05-07T20:33:07.5798978Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5798982Z 
2025-05-07T20:33:07.5799396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5799404Z 
2025-05-07T20:33:07.5799504Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5799730Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5799854Z     T=1,
2025-05-07T20:33:07.5799927Z     D=5120,
2025-05-07T20:33:07.5800013Z     scale_ub=1200.0,
2025-05-07T20:33:07.5800096Z     contiguous=False,
2025-05-07T20:33:07.5800176Z     compiled=False,
2025-05-07T20:33:07.5800247Z )
2025-05-07T20:33:07.5800465Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5800628Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:07.5800632Z 
2025-05-07T20:33:07.5800710Z     @given(
2025-05-07T20:33:07.5800825Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5800926Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5801042Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5801154Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5801266Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5801336Z     )
2025-05-07T20:33:07.5801627Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5801724Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5801796Z         self,
2025-05-07T20:33:07.5801866Z         T: int,
2025-05-07T20:33:07.5801946Z         D: int,
2025-05-07T20:33:07.5802089Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5802174Z         contiguous: bool,
2025-05-07T20:33:07.5802260Z         compiled: bool,
2025-05-07T20:33:07.5802335Z     ) -> None:
2025-05-07T20:33:07.5802427Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5802500Z     
2025-05-07T20:33:07.5802666Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5802741Z     
2025-05-07T20:33:07.5802829Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5802946Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5803035Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5803114Z         x0 = x[:, :D]
2025-05-07T20:33:07.5803200Z         x1 = x[:, D:]
2025-05-07T20:33:07.5803275Z     
2025-05-07T20:33:07.5803353Z         if contiguous:
2025-05-07T20:33:07.5803438Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5803524Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5803591Z     
2025-05-07T20:33:07.5803675Z         if scale_ub is not None:
2025-05-07T20:33:07.5804060Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5804200Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5804274Z             )
2025-05-07T20:33:07.5804345Z         else:
2025-05-07T20:33:07.5804435Z             scale_ub_tensor = None
2025-05-07T20:33:07.5804505Z     
2025-05-07T20:33:07.5804631Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5804716Z             op = silu_mul_quant
2025-05-07T20:33:07.5804799Z             if compiled:
2025-05-07T20:33:07.5804894Z                 op = torch.compile(op)
2025-05-07T20:33:07.5804997Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5805078Z     
2025-05-07T20:33:07.5805165Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5805170Z 
2025-05-07T20:33:07.5805263Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5805393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5805498Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5805602Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5806098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5806189Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5806545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5806763Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5807103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5807273Z     kernel = self.compile(
2025-05-07T20:33:07.5807648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5807821Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5807947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5807952Z 
2025-05-07T20:33:07.5808153Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917e0f4f0>
2025-05-07T20:33:07.5808939Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5809498Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917927550>}
2025-05-07T20:33:07.5810249Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5810435Z context = <triton._C.libtriton.ir.context object at 0x7f89179490b0>
2025-05-07T20:33:07.5810497Z 
2025-05-07T20:33:07.5810660Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5810918Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5811021Z                            module_map=module_map)
2025-05-07T20:33:07.5811181Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5811272Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5811345Z E       ^
2025-05-07T20:33:07.5811701Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5811709Z 
2025-05-07T20:33:07.5812115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5812120Z 
2025-05-07T20:33:07.5812220Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5812494Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5812573Z     T=16384,
2025-05-07T20:33:07.5812646Z     D=5120,
2025-05-07T20:33:07.5812729Z     scale_ub=1200.0,
2025-05-07T20:33:07.5812813Z     contiguous=False,
2025-05-07T20:33:07.5812893Z     compiled=True,
2025-05-07T20:33:07.5812962Z )
2025-05-07T20:33:07.5813188Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5813364Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:07.5813369Z 
2025-05-07T20:33:07.5813443Z     @given(
2025-05-07T20:33:07.5813559Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5813660Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5813772Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5813889Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5813999Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5814070Z     )
2025-05-07T20:33:07.5814317Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5814407Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5814482Z         self,
2025-05-07T20:33:07.5814550Z         T: int,
2025-05-07T20:33:07.5814620Z         D: int,
2025-05-07T20:33:07.5814718Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5814803Z         contiguous: bool,
2025-05-07T20:33:07.5814883Z         compiled: bool,
2025-05-07T20:33:07.5814959Z     ) -> None:
2025-05-07T20:33:07.5815051Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5815122Z     
2025-05-07T20:33:07.5815294Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5815414Z     
2025-05-07T20:33:07.5815503Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5815633Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5815718Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5815799Z         x0 = x[:, :D]
2025-05-07T20:33:07.5815878Z         x1 = x[:, D:]
2025-05-07T20:33:07.5815944Z     
2025-05-07T20:33:07.5816023Z         if contiguous:
2025-05-07T20:33:07.5816110Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5816195Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5816267Z     
2025-05-07T20:33:07.5816353Z         if scale_ub is not None:
2025-05-07T20:33:07.5816453Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5816589Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5816660Z             )
2025-05-07T20:33:07.5816734Z         else:
2025-05-07T20:33:07.5816828Z             scale_ub_tensor = None
2025-05-07T20:33:07.5816897Z     
2025-05-07T20:33:07.5817066Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5817159Z             op = silu_mul_quant
2025-05-07T20:33:07.5817239Z             if compiled:
2025-05-07T20:33:07.5817339Z                 op = torch.compile(op)
2025-05-07T20:33:07.5817440Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5817553Z     
2025-05-07T20:33:07.5817640Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5817644Z 
2025-05-07T20:33:07.5817736Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5817858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5817957Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5818050Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5818413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.5818504Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.5818998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5819100Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5819452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5819716Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5820053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5820142Z     kernel = self.compile(
2025-05-07T20:33:07.5820520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5820689Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5820812Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5820817Z 
2025-05-07T20:33:07.5821030Z self = <triton.compiler.compiler.ASTSource object at 0x7f891793d370>
2025-05-07T20:33:07.5821812Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5822317Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89180401f0>}
2025-05-07T20:33:07.5823061Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5823246Z context = <triton._C.libtriton.ir.context object at 0x7f8918047070>
2025-05-07T20:33:07.5823254Z 
2025-05-07T20:33:07.5823416Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5823720Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5823830Z                            module_map=module_map)
2025-05-07T20:33:07.5823988Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5824088Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5824165Z E       ^
2025-05-07T20:33:07.5824515Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5824520Z 
2025-05-07T20:33:07.5824933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5824937Z 
2025-05-07T20:33:07.5825040Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5825258Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5825339Z     T=2048,
2025-05-07T20:33:07.5825410Z     D=7168,
2025-05-07T20:33:07.5825531Z     scale_ub=1200.0,
2025-05-07T20:33:07.5825616Z     contiguous=False,
2025-05-07T20:33:07.5825697Z     compiled=True,
2025-05-07T20:33:07.5825767Z )
2025-05-07T20:33:07.5825983Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5826156Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:07.5826223Z 
2025-05-07T20:33:07.5826300Z     @given(
2025-05-07T20:33:07.5826415Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5826508Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5826622Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5826733Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5826842Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5826922Z     )
2025-05-07T20:33:07.5827164Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5827258Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5827336Z         self,
2025-05-07T20:33:07.5827410Z         T: int,
2025-05-07T20:33:07.5827486Z         D: int,
2025-05-07T20:33:07.5827581Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5827666Z         contiguous: bool,
2025-05-07T20:33:07.5827750Z         compiled: bool,
2025-05-07T20:33:07.5827868Z     ) -> None:
2025-05-07T20:33:07.5827960Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5828035Z     
2025-05-07T20:33:07.5828199Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5828269Z     
2025-05-07T20:33:07.5828359Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5828478Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5828565Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5828647Z         x0 = x[:, :D]
2025-05-07T20:33:07.5828723Z         x1 = x[:, D:]
2025-05-07T20:33:07.5828795Z     
2025-05-07T20:33:07.5828875Z         if contiguous:
2025-05-07T20:33:07.5828968Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5829055Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5829123Z     
2025-05-07T20:33:07.5829210Z         if scale_ub is not None:
2025-05-07T20:33:07.5829316Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5829450Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5829523Z             )
2025-05-07T20:33:07.5829596Z         else:
2025-05-07T20:33:07.5829685Z             scale_ub_tensor = None
2025-05-07T20:33:07.5829752Z     
2025-05-07T20:33:07.5829982Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5830068Z             op = silu_mul_quant
2025-05-07T20:33:07.5830148Z             if compiled:
2025-05-07T20:33:07.5830248Z                 op = torch.compile(op)
2025-05-07T20:33:07.5830349Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5830417Z     
2025-05-07T20:33:07.5830504Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5830508Z 
2025-05-07T20:33:07.5830654Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5830782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5830878Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5830973Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5831342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.5831433Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.5831949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5832053Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5832416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5832638Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5833011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5833104Z     kernel = self.compile(
2025-05-07T20:33:07.5833481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5833657Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5833821Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5833826Z 
2025-05-07T20:33:07.5834028Z self = <triton.compiler.compiler.ASTSource object at 0x7f891801ac70>
2025-05-07T20:33:07.5834809Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5835317Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8918040ee0>}
2025-05-07T20:33:07.5836066Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5836296Z context = <triton._C.libtriton.ir.context object at 0x7f89178e3e30>
2025-05-07T20:33:07.5836301Z 
2025-05-07T20:33:07.5836464Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5836726Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5836828Z                            module_map=module_map)
2025-05-07T20:33:07.5836989Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5837089Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5837162Z E       ^
2025-05-07T20:33:07.5837517Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5837525Z 
2025-05-07T20:33:07.5842009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5842018Z 
2025-05-07T20:33:07.5842130Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5842363Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5842442Z     T=1,
2025-05-07T20:33:07.5842514Z     D=5120,
2025-05-07T20:33:07.5842590Z     scale_ub=None,
2025-05-07T20:33:07.5842674Z     contiguous=False,
2025-05-07T20:33:07.5842752Z     compiled=False,
2025-05-07T20:33:07.5842822Z )
2025-05-07T20:33:07.5843047Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5843212Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.5843217Z 
2025-05-07T20:33:07.5843293Z     @given(
2025-05-07T20:33:07.5843481Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5843576Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5843691Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5843803Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5843915Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5843988Z     )
2025-05-07T20:33:07.5844230Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5844324Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5844398Z         self,
2025-05-07T20:33:07.5844472Z         T: int,
2025-05-07T20:33:07.5844546Z         D: int,
2025-05-07T20:33:07.5844641Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5844724Z         contiguous: bool,
2025-05-07T20:33:07.5844810Z         compiled: bool,
2025-05-07T20:33:07.5844883Z     ) -> None:
2025-05-07T20:33:07.5844971Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5845043Z     
2025-05-07T20:33:07.5845253Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5845324Z     
2025-05-07T20:33:07.5845416Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5845535Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5845620Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5845744Z         x0 = x[:, :D]
2025-05-07T20:33:07.5845819Z         x1 = x[:, D:]
2025-05-07T20:33:07.5845892Z     
2025-05-07T20:33:07.5845971Z         if contiguous:
2025-05-07T20:33:07.5846057Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5846144Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5846214Z     
2025-05-07T20:33:07.5846300Z         if scale_ub is not None:
2025-05-07T20:33:07.5846402Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5846534Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5846608Z             )
2025-05-07T20:33:07.5846681Z         else:
2025-05-07T20:33:07.5846773Z             scale_ub_tensor = None
2025-05-07T20:33:07.5846847Z     
2025-05-07T20:33:07.5846978Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5847065Z             op = silu_mul_quant
2025-05-07T20:33:07.5847152Z             if compiled:
2025-05-07T20:33:07.5847249Z                 op = torch.compile(op)
2025-05-07T20:33:07.5847394Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5847470Z     
2025-05-07T20:33:07.5847558Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5847562Z 
2025-05-07T20:33:07.5847655Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5847783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5847881Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5847977Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5848481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5848579Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5848936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5849155Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5849491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5849588Z     kernel = self.compile(
2025-05-07T20:33:07.5849962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5850135Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5850258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5850262Z 
2025-05-07T20:33:07.5850464Z self = <triton.compiler.compiler.ASTSource object at 0x7f89178cf370>
2025-05-07T20:33:07.5851249Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5851798Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89178db5e0>}
2025-05-07T20:33:07.5852549Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5852733Z context = <triton._C.libtriton.ir.context object at 0x7f8918092730>
2025-05-07T20:33:07.5852737Z 
2025-05-07T20:33:07.5852900Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5853162Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5853308Z                            module_map=module_map)
2025-05-07T20:33:07.5853470Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5853564Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5853637Z E       ^
2025-05-07T20:33:07.5854003Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5854046Z 
2025-05-07T20:33:07.5854456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5854460Z 
2025-05-07T20:33:07.5854560Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5854776Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5854847Z     T=4096,
2025-05-07T20:33:07.5854921Z     D=7168,
2025-05-07T20:33:07.5854998Z     scale_ub=1200.0,
2025-05-07T20:33:07.5855079Z     contiguous=False,
2025-05-07T20:33:07.5855162Z     compiled=False,
2025-05-07T20:33:07.5855237Z )
2025-05-07T20:33:07.5855455Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5855632Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:07.5855636Z 
2025-05-07T20:33:07.5855709Z     @given(
2025-05-07T20:33:07.5855866Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5855963Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5856075Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5856195Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5856306Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5856376Z     )
2025-05-07T20:33:07.5856619Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5856707Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5856776Z         self,
2025-05-07T20:33:07.5856853Z         T: int,
2025-05-07T20:33:07.5856933Z         D: int,
2025-05-07T20:33:07.5857027Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5857113Z         contiguous: bool,
2025-05-07T20:33:07.5857195Z         compiled: bool,
2025-05-07T20:33:07.5857273Z     ) -> None:
2025-05-07T20:33:07.5857364Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5857433Z     
2025-05-07T20:33:07.5857605Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5857676Z     
2025-05-07T20:33:07.5857764Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5857885Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5857969Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5858043Z         x0 = x[:, :D]
2025-05-07T20:33:07.5858120Z         x1 = x[:, D:]
2025-05-07T20:33:07.5858190Z     
2025-05-07T20:33:07.5858266Z         if contiguous:
2025-05-07T20:33:07.5858356Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5858446Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5858515Z     
2025-05-07T20:33:07.5858652Z         if scale_ub is not None:
2025-05-07T20:33:07.5858754Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5858886Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5858957Z             )
2025-05-07T20:33:07.5859032Z         else:
2025-05-07T20:33:07.5859133Z             scale_ub_tensor = None
2025-05-07T20:33:07.5859202Z     
2025-05-07T20:33:07.5859327Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5859413Z             op = silu_mul_quant
2025-05-07T20:33:07.5859493Z             if compiled:
2025-05-07T20:33:07.5859586Z                 op = torch.compile(op)
2025-05-07T20:33:07.5859689Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5859756Z     
2025-05-07T20:33:07.5859842Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5859850Z 
2025-05-07T20:33:07.5859943Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5860135Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5860241Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5860337Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5860838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5860979Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5861334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5861555Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5861917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5862017Z     kernel = self.compile(
2025-05-07T20:33:07.5862410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5862583Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5862707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5862712Z 
2025-05-07T20:33:07.5862915Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917afee20>
2025-05-07T20:33:07.5863732Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5864248Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917f8b1f0>}
2025-05-07T20:33:07.5864993Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5865186Z context = <triton._C.libtriton.ir.context object at 0x7f8917f83af0>
2025-05-07T20:33:07.5865191Z 
2025-05-07T20:33:07.5865351Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5865608Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5865721Z                            module_map=module_map)
2025-05-07T20:33:07.5865883Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5865979Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5866056Z E       ^
2025-05-07T20:33:07.5866407Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5866412Z 
2025-05-07T20:33:07.5866824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5866829Z 
2025-05-07T20:33:07.5866928Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5867191Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5867274Z     T=16384,
2025-05-07T20:33:07.5867349Z     D=7168,
2025-05-07T20:33:07.5867425Z     scale_ub=None,
2025-05-07T20:33:07.5867506Z     contiguous=True,
2025-05-07T20:33:07.5867586Z     compiled=True,
2025-05-07T20:33:07.5867668Z )
2025-05-07T20:33:07.5867886Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5868056Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.5868061Z 
2025-05-07T20:33:07.5868135Z     @given(
2025-05-07T20:33:07.5868254Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5868347Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5868460Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5868572Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5868721Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5868800Z     )
2025-05-07T20:33:07.5869041Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5869133Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5869205Z         self,
2025-05-07T20:33:07.5869274Z         T: int,
2025-05-07T20:33:07.5869392Z         D: int,
2025-05-07T20:33:07.5869485Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5869568Z         contiguous: bool,
2025-05-07T20:33:07.5869653Z         compiled: bool,
2025-05-07T20:33:07.5869728Z     ) -> None:
2025-05-07T20:33:07.5869930Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5870003Z     
2025-05-07T20:33:07.5870170Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5870241Z     
2025-05-07T20:33:07.5870331Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5870449Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5870537Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5870621Z         x0 = x[:, :D]
2025-05-07T20:33:07.5870702Z         x1 = x[:, D:]
2025-05-07T20:33:07.5870777Z     
2025-05-07T20:33:07.5870857Z         if contiguous:
2025-05-07T20:33:07.5870948Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5871039Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5871108Z     
2025-05-07T20:33:07.5871248Z         if scale_ub is not None:
2025-05-07T20:33:07.5871358Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5871490Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5871564Z             )
2025-05-07T20:33:07.5871650Z         else:
2025-05-07T20:33:07.5871759Z             scale_ub_tensor = None
2025-05-07T20:33:07.5871838Z     
2025-05-07T20:33:07.5871984Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5872072Z             op = silu_mul_quant
2025-05-07T20:33:07.5872159Z             if compiled:
2025-05-07T20:33:07.5872256Z                 op = torch.compile(op)
2025-05-07T20:33:07.5872370Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5872447Z     
2025-05-07T20:33:07.5872536Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5872541Z 
2025-05-07T20:33:07.5872635Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5872762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5872866Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5872962Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5873332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.5873422Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.5873919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5874020Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5874377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5874656Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5874994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5875090Z     kernel = self.compile(
2025-05-07T20:33:07.5875477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5875652Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5875778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5875783Z 
2025-05-07T20:33:07.5875985Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917f874c0>
2025-05-07T20:33:07.5876815Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5877327Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917f8bee0>}
2025-05-07T20:33:07.5878082Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5878315Z context = <triton._C.libtriton.ir.context object at 0x7f8917d5fd70>
2025-05-07T20:33:07.5878320Z 
2025-05-07T20:33:07.5878484Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5878755Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5878862Z                            module_map=module_map)
2025-05-07T20:33:07.5879026Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5879128Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5879204Z E       ^
2025-05-07T20:33:07.5879563Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5879571Z 
2025-05-07T20:33:07.5880021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5880028Z 
2025-05-07T20:33:07.5880125Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5880349Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5880421Z     T=4096,
2025-05-07T20:33:07.5880492Z     D=5120,
2025-05-07T20:33:07.5880572Z     scale_ub=None,
2025-05-07T20:33:07.5880651Z     contiguous=False,
2025-05-07T20:33:07.5880729Z     compiled=True,
2025-05-07T20:33:07.5880801Z )
2025-05-07T20:33:07.5881020Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5881196Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:07.5881201Z 
2025-05-07T20:33:07.5881272Z     @given(
2025-05-07T20:33:07.5881388Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5881493Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5881612Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5881724Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5881836Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5881906Z     )
2025-05-07T20:33:07.5882151Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5882242Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5882312Z         self,
2025-05-07T20:33:07.5882390Z         T: int,
2025-05-07T20:33:07.5882463Z         D: int,
2025-05-07T20:33:07.5882558Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5882647Z         contiguous: bool,
2025-05-07T20:33:07.5882777Z         compiled: bool,
2025-05-07T20:33:07.5882851Z     ) -> None:
2025-05-07T20:33:07.5882943Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5883013Z     
2025-05-07T20:33:07.5883179Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5883253Z     
2025-05-07T20:33:07.5883345Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5883464Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5883550Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5883626Z         x0 = x[:, :D]
2025-05-07T20:33:07.5883704Z         x1 = x[:, D:]
2025-05-07T20:33:07.5883773Z     
2025-05-07T20:33:07.5883851Z         if contiguous:
2025-05-07T20:33:07.5883940Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5884025Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5884095Z     
2025-05-07T20:33:07.5884185Z         if scale_ub is not None:
2025-05-07T20:33:07.5884286Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5884461Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5884538Z             )
2025-05-07T20:33:07.5884612Z         else:
2025-05-07T20:33:07.5884702Z             scale_ub_tensor = None
2025-05-07T20:33:07.5884778Z     
2025-05-07T20:33:07.5884903Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5885037Z             op = silu_mul_quant
2025-05-07T20:33:07.5885124Z             if compiled:
2025-05-07T20:33:07.5885219Z                 op = torch.compile(op)
2025-05-07T20:33:07.5885321Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5885389Z     
2025-05-07T20:33:07.5885475Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5885480Z 
2025-05-07T20:33:07.5885576Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5885698Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5885795Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5885895Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5886275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.5886369Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.5886902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5887003Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5887360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5887583Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5887920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5888019Z     kernel = self.compile(
2025-05-07T20:33:07.5888394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5888573Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5888695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5888699Z 
2025-05-07T20:33:07.5888899Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917d7a550>
2025-05-07T20:33:07.5889691Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5890194Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917d7e940>}
2025-05-07T20:33:07.5890954Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5891205Z context = <triton._C.libtriton.ir.context object at 0x7f8917b27df0>
2025-05-07T20:33:07.5891210Z 
2025-05-07T20:33:07.5891373Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5891639Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5891743Z                            module_map=module_map)
2025-05-07T20:33:07.5891903Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5892024Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5892125Z E       ^
2025-05-07T20:33:07.5892559Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5892565Z 
2025-05-07T20:33:07.5892978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5892983Z 
2025-05-07T20:33:07.5893138Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5893358Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5893432Z     T=4096,
2025-05-07T20:33:07.5893503Z     D=5120,
2025-05-07T20:33:07.5893580Z     scale_ub=1200.0,
2025-05-07T20:33:07.5893662Z     contiguous=False,
2025-05-07T20:33:07.5893790Z     compiled=False,
2025-05-07T20:33:07.5893862Z )
2025-05-07T20:33:07.5894074Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5894249Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:07.5894254Z 
2025-05-07T20:33:07.5894325Z     @given(
2025-05-07T20:33:07.5894443Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5894538Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5894649Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5894767Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5894885Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5894959Z     )
2025-05-07T20:33:07.5895206Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5895293Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5895362Z         self,
2025-05-07T20:33:07.5895484Z         T: int,
2025-05-07T20:33:07.5895558Z         D: int,
2025-05-07T20:33:07.5895654Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5895740Z         contiguous: bool,
2025-05-07T20:33:07.5895819Z         compiled: bool,
2025-05-07T20:33:07.5895896Z     ) -> None:
2025-05-07T20:33:07.5895986Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5896055Z     
2025-05-07T20:33:07.5896221Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5896291Z     
2025-05-07T20:33:07.5896378Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5896499Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5896592Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5896670Z         x0 = x[:, :D]
2025-05-07T20:33:07.5896750Z         x1 = x[:, D:]
2025-05-07T20:33:07.5896819Z     
2025-05-07T20:33:07.5896901Z         if contiguous:
2025-05-07T20:33:07.5896989Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5897073Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5897155Z     
2025-05-07T20:33:07.5897243Z         if scale_ub is not None:
2025-05-07T20:33:07.5897343Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5897478Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5897552Z             )
2025-05-07T20:33:07.5897632Z         else:
2025-05-07T20:33:07.5897727Z             scale_ub_tensor = None
2025-05-07T20:33:07.5897796Z     
2025-05-07T20:33:07.5897924Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5898010Z             op = silu_mul_quant
2025-05-07T20:33:07.5898090Z             if compiled:
2025-05-07T20:33:07.5898240Z                 op = torch.compile(op)
2025-05-07T20:33:07.5898342Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5898416Z     
2025-05-07T20:33:07.5898506Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5898511Z 
2025-05-07T20:33:07.5898603Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5898734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5898832Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5898927Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5899432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5899527Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5899883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5900112Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5900494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5900596Z     kernel = self.compile(
2025-05-07T20:33:07.5900974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5901188Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5901313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5901318Z 
2025-05-07T20:33:07.5901519Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917d64250>
2025-05-07T20:33:07.5902304Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5902811Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917bae3a0>}
2025-05-07T20:33:07.5903595Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5904003Z context = <triton._C.libtriton.ir.context object at 0x7f8917bbf670>
2025-05-07T20:33:07.5904010Z 
2025-05-07T20:33:07.5904176Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5904438Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5904540Z                            module_map=module_map)
2025-05-07T20:33:07.5904696Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5904794Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5904867Z E       ^
2025-05-07T20:33:07.5905230Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5905241Z 
2025-05-07T20:33:07.5905653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5905661Z 
2025-05-07T20:33:07.5905762Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5905985Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5906059Z     T=4096,
2025-05-07T20:33:07.5906127Z     D=5120,
2025-05-07T20:33:07.5906217Z     scale_ub=1200.0,
2025-05-07T20:33:07.5906299Z     contiguous=False,
2025-05-07T20:33:07.5906379Z     compiled=True,
2025-05-07T20:33:07.5906453Z )
2025-05-07T20:33:07.5906669Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5906844Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:07.5906849Z 
2025-05-07T20:33:07.5907022Z     @given(
2025-05-07T20:33:07.5907137Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5907235Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5907347Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5907460Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5907579Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5907649Z     )
2025-05-07T20:33:07.5907893Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5907980Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5908052Z         self,
2025-05-07T20:33:07.5908128Z         T: int,
2025-05-07T20:33:07.5908199Z         D: int,
2025-05-07T20:33:07.5908292Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5908380Z         contiguous: bool,
2025-05-07T20:33:07.5908461Z         compiled: bool,
2025-05-07T20:33:07.5908531Z     ) -> None:
2025-05-07T20:33:07.5908695Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5908770Z     
2025-05-07T20:33:07.5908934Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5909009Z     
2025-05-07T20:33:07.5909097Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5909215Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5909368Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5909449Z         x0 = x[:, :D]
2025-05-07T20:33:07.5909527Z         x1 = x[:, D:]
2025-05-07T20:33:07.5909596Z     
2025-05-07T20:33:07.5909673Z         if contiguous:
2025-05-07T20:33:07.5909771Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5909920Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5909988Z     
2025-05-07T20:33:07.5910080Z         if scale_ub is not None:
2025-05-07T20:33:07.5910183Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5910312Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5910387Z             )
2025-05-07T20:33:07.5910465Z         else:
2025-05-07T20:33:07.5910553Z             scale_ub_tensor = None
2025-05-07T20:33:07.5910626Z     
2025-05-07T20:33:07.5910752Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5910839Z             op = silu_mul_quant
2025-05-07T20:33:07.5910918Z             if compiled:
2025-05-07T20:33:07.5911084Z                 op = torch.compile(op)
2025-05-07T20:33:07.5911190Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5911259Z     
2025-05-07T20:33:07.5911346Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5911350Z 
2025-05-07T20:33:07.5911447Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5911570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5911663Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5911762Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5912125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.5912223Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.5912710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5912805Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5913159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5913383Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5913714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5913808Z     kernel = self.compile(
2025-05-07T20:33:07.5914180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5914353Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5914477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5914526Z 
2025-05-07T20:33:07.5914727Z self = <triton.compiler.compiler.ASTSource object at 0x7f89179a2f70>
2025-05-07T20:33:07.5915511Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5916014Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917bae280>}
2025-05-07T20:33:07.5916762Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5916948Z context = <triton._C.libtriton.ir.context object at 0x7f8917b4ecb0>
2025-05-07T20:33:07.5916996Z 
2025-05-07T20:33:07.5917161Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5917417Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5917520Z                            module_map=module_map)
2025-05-07T20:33:07.5917721Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5917817Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5917892Z E       ^
2025-05-07T20:33:07.5918255Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5918260Z 
2025-05-07T20:33:07.5918669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5918674Z 
2025-05-07T20:33:07.5918776Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5918998Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5919074Z     T=2048,
2025-05-07T20:33:07.5919150Z     D=7168,
2025-05-07T20:33:07.5919230Z     scale_ub=1200.0,
2025-05-07T20:33:07.5919310Z     contiguous=False,
2025-05-07T20:33:07.5919393Z     compiled=False,
2025-05-07T20:33:07.5919462Z )
2025-05-07T20:33:07.5919740Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5919919Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:07.5919924Z 
2025-05-07T20:33:07.5919994Z     @given(
2025-05-07T20:33:07.5920114Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5920207Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5920318Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5920433Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5920542Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5920613Z     )
2025-05-07T20:33:07.5920863Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5920951Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5921027Z         self,
2025-05-07T20:33:07.5921100Z         T: int,
2025-05-07T20:33:07.5921169Z         D: int,
2025-05-07T20:33:07.5921267Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5921355Z         contiguous: bool,
2025-05-07T20:33:07.5921436Z         compiled: bool,
2025-05-07T20:33:07.5921511Z     ) -> None:
2025-05-07T20:33:07.5921601Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5921670Z     
2025-05-07T20:33:07.5921839Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5921910Z     
2025-05-07T20:33:07.5921998Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5922118Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5922204Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5922281Z         x0 = x[:, :D]
2025-05-07T20:33:07.5922361Z         x1 = x[:, D:]
2025-05-07T20:33:07.5922489Z     
2025-05-07T20:33:07.5922571Z         if contiguous:
2025-05-07T20:33:07.5922656Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5922741Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5922816Z     
2025-05-07T20:33:07.5922901Z         if scale_ub is not None:
2025-05-07T20:33:07.5923007Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5923138Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5923210Z             )
2025-05-07T20:33:07.5923280Z         else:
2025-05-07T20:33:07.5923374Z             scale_ub_tensor = None
2025-05-07T20:33:07.5923440Z     
2025-05-07T20:33:07.5923565Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5923656Z             op = silu_mul_quant
2025-05-07T20:33:07.5923737Z             if compiled:
2025-05-07T20:33:07.5923836Z                 op = torch.compile(op)
2025-05-07T20:33:07.5923938Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5924048Z     
2025-05-07T20:33:07.5924139Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5924143Z 
2025-05-07T20:33:07.5924235Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5924358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5924457Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5924661Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5925322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5925447Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5925942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5926249Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5926680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5926783Z     kernel = self.compile(
2025-05-07T20:33:07.5927169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5927342Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5927537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5927546Z 
2025-05-07T20:33:07.5927751Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917e775e0>
2025-05-07T20:33:07.5928530Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5929043Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917b7d670>}
2025-05-07T20:33:07.5929794Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5929989Z context = <triton._C.libtriton.ir.context object at 0x7f8917a609b0>
2025-05-07T20:33:07.5929998Z 
2025-05-07T20:33:07.5930160Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5930421Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5930529Z                            module_map=module_map)
2025-05-07T20:33:07.5930690Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5930787Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5930861Z E       ^
2025-05-07T20:33:07.5931222Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5931275Z 
2025-05-07T20:33:07.5931691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5931696Z 
2025-05-07T20:33:07.5931801Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5932029Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5932114Z     T=1,
2025-05-07T20:33:07.5932189Z     D=7168,
2025-05-07T20:33:07.5932276Z     scale_ub=None,
2025-05-07T20:33:07.5932360Z     contiguous=True,
2025-05-07T20:33:07.5932441Z     compiled=False,
2025-05-07T20:33:07.5932516Z )
2025-05-07T20:33:07.5932730Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5932892Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.5932897Z 
2025-05-07T20:33:07.5932974Z     @given(
2025-05-07T20:33:07.5933091Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5933236Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5933353Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5933468Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5933586Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5933659Z     )
2025-05-07T20:33:07.5933944Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5934043Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5934119Z         self,
2025-05-07T20:33:07.5934193Z         T: int,
2025-05-07T20:33:07.5934272Z         D: int,
2025-05-07T20:33:07.5934370Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5934457Z         contiguous: bool,
2025-05-07T20:33:07.5934547Z         compiled: bool,
2025-05-07T20:33:07.5934620Z     ) -> None:
2025-05-07T20:33:07.5934716Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5934788Z     
2025-05-07T20:33:07.5934951Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5935034Z     
2025-05-07T20:33:07.5935120Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5935238Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5935326Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5935402Z         x0 = x[:, :D]
2025-05-07T20:33:07.5935478Z         x1 = x[:, D:]
2025-05-07T20:33:07.5935596Z     
2025-05-07T20:33:07.5935681Z         if contiguous:
2025-05-07T20:33:07.5935768Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5935853Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5935924Z     
2025-05-07T20:33:07.5936011Z         if scale_ub is not None:
2025-05-07T20:33:07.5936116Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5936247Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5936326Z             )
2025-05-07T20:33:07.5936401Z         else:
2025-05-07T20:33:07.5936491Z             scale_ub_tensor = None
2025-05-07T20:33:07.5936565Z     
2025-05-07T20:33:07.5936698Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5936783Z             op = silu_mul_quant
2025-05-07T20:33:07.5936866Z             if compiled:
2025-05-07T20:33:07.5936960Z                 op = torch.compile(op)
2025-05-07T20:33:07.5937061Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5937136Z     
2025-05-07T20:33:07.5937224Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5937229Z 
2025-05-07T20:33:07.5937326Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5937452Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5937549Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5937649Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5938206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5938298Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5938656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5938928Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5939266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5939362Z     kernel = self.compile(
2025-05-07T20:33:07.5939738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5939912Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5940033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5940037Z 
2025-05-07T20:33:07.5940239Z self = <triton.compiler.compiler.ASTSource object at 0x7f89179dd8b0>
2025-05-07T20:33:07.5941057Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5941565Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89177e0280>}
2025-05-07T20:33:07.5942354Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5942539Z context = <triton._C.libtriton.ir.context object at 0x7f8917807970>
2025-05-07T20:33:07.5942544Z 
2025-05-07T20:33:07.5942709Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5942967Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5943072Z                            module_map=module_map)
2025-05-07T20:33:07.5943240Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5943336Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5943406Z E       ^
2025-05-07T20:33:07.5943762Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5943770Z 
2025-05-07T20:33:07.5944224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5944229Z 
2025-05-07T20:33:07.5944332Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5944549Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5944622Z     T=16384,
2025-05-07T20:33:07.5944700Z     D=7168,
2025-05-07T20:33:07.5944779Z     scale_ub=1200.0,
2025-05-07T20:33:07.5944861Z     contiguous=False,
2025-05-07T20:33:07.5944944Z     compiled=True,
2025-05-07T20:33:07.5945013Z )
2025-05-07T20:33:07.5945232Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5945408Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:07.5945413Z 
2025-05-07T20:33:07.5945486Z     @given(
2025-05-07T20:33:07.5945603Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5945705Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5945814Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5945930Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5946037Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5946106Z     )
2025-05-07T20:33:07.5946351Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5946441Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5946514Z         self,
2025-05-07T20:33:07.5946586Z         T: int,
2025-05-07T20:33:07.5946655Z         D: int,
2025-05-07T20:33:07.5946755Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5946886Z         contiguous: bool,
2025-05-07T20:33:07.5946968Z         compiled: bool,
2025-05-07T20:33:07.5947041Z     ) -> None:
2025-05-07T20:33:07.5947133Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5947202Z     
2025-05-07T20:33:07.5947371Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5947447Z     
2025-05-07T20:33:07.5947533Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5947654Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5947744Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5947821Z         x0 = x[:, :D]
2025-05-07T20:33:07.5947897Z         x1 = x[:, D:]
2025-05-07T20:33:07.5947964Z     
2025-05-07T20:33:07.5948049Z         if contiguous:
2025-05-07T20:33:07.5948135Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5948221Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5948295Z     
2025-05-07T20:33:07.5948381Z         if scale_ub is not None:
2025-05-07T20:33:07.5948529Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5948667Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5948738Z             )
2025-05-07T20:33:07.5948812Z         else:
2025-05-07T20:33:07.5948908Z             scale_ub_tensor = None
2025-05-07T20:33:07.5948976Z     
2025-05-07T20:33:07.5949181Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5949271Z             op = silu_mul_quant
2025-05-07T20:33:07.5949351Z             if compiled:
2025-05-07T20:33:07.5949450Z                 op = torch.compile(op)
2025-05-07T20:33:07.5949551Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5949622Z     
2025-05-07T20:33:07.5949712Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5949716Z 
2025-05-07T20:33:07.5949896Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5950027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5950126Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5950227Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5950599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.5950685Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.5951219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5951324Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5951678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5951900Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5952237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5952326Z     kernel = self.compile(
2025-05-07T20:33:07.5952709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5952881Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5953002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5953006Z 
2025-05-07T20:33:07.5953216Z self = <triton.compiler.compiler.ASTSource object at 0x7f89177eb6d0>
2025-05-07T20:33:07.5954001Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5954509Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89177e0ee0>}
2025-05-07T20:33:07.5955254Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5955485Z context = <triton._C.libtriton.ir.context object at 0x7f8917aa47f0>
2025-05-07T20:33:07.5955493Z 
2025-05-07T20:33:07.5955652Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5955916Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5956020Z                            module_map=module_map)
2025-05-07T20:33:07.5956179Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5956275Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5956353Z E       ^
2025-05-07T20:33:07.5956713Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5956718Z 
2025-05-07T20:33:07.5957165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5957173Z 
2025-05-07T20:33:07.5957271Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5957490Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5957569Z     T=1,
2025-05-07T20:33:07.5957643Z     D=7168,
2025-05-07T20:33:07.5957763Z     scale_ub=None,
2025-05-07T20:33:07.5957853Z     contiguous=False,
2025-05-07T20:33:07.5957937Z     compiled=False,
2025-05-07T20:33:07.5958006Z )
2025-05-07T20:33:07.5958221Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5958385Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.5958390Z 
2025-05-07T20:33:07.5958473Z     @given(
2025-05-07T20:33:07.5958589Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5958686Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5958800Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5963456Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5963592Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5963666Z     )
2025-05-07T20:33:07.5963918Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5964012Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5964155Z         self,
2025-05-07T20:33:07.5964232Z         T: int,
2025-05-07T20:33:07.5964304Z         D: int,
2025-05-07T20:33:07.5964398Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5964483Z         contiguous: bool,
2025-05-07T20:33:07.5964566Z         compiled: bool,
2025-05-07T20:33:07.5964647Z     ) -> None:
2025-05-07T20:33:07.5964737Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5964811Z     
2025-05-07T20:33:07.5964985Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5965057Z     
2025-05-07T20:33:07.5965152Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5965277Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5965364Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5965443Z         x0 = x[:, :D]
2025-05-07T20:33:07.5965517Z         x1 = x[:, D:]
2025-05-07T20:33:07.5965589Z     
2025-05-07T20:33:07.5965673Z         if contiguous:
2025-05-07T20:33:07.5965760Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5965852Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5965923Z     
2025-05-07T20:33:07.5966010Z         if scale_ub is not None:
2025-05-07T20:33:07.5966111Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5966245Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5966318Z             )
2025-05-07T20:33:07.5966392Z         else:
2025-05-07T20:33:07.5966484Z             scale_ub_tensor = None
2025-05-07T20:33:07.5966552Z     
2025-05-07T20:33:07.5966683Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5966771Z             op = silu_mul_quant
2025-05-07T20:33:07.5966904Z             if compiled:
2025-05-07T20:33:07.5967007Z                 op = torch.compile(op)
2025-05-07T20:33:07.5967112Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5967181Z     
2025-05-07T20:33:07.5967270Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5967275Z 
2025-05-07T20:33:07.5967376Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5967508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5967607Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5967701Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5968207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5968298Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5968651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5968920Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5969261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5969350Z     kernel = self.compile(
2025-05-07T20:33:07.5969736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5969948Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5970074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5970079Z 
2025-05-07T20:33:07.5970282Z self = <triton.compiler.compiler.ASTSource object at 0x7f89177e9a30>
2025-05-07T20:33:07.5971063Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5971576Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917abd670>}
2025-05-07T20:33:07.5972355Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5972551Z context = <triton._C.libtriton.ir.context object at 0x7f89177c41b0>
2025-05-07T20:33:07.5972556Z 
2025-05-07T20:33:07.5972716Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5972978Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5973081Z                            module_map=module_map)
2025-05-07T20:33:07.5973239Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5973334Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5973414Z E       ^
2025-05-07T20:33:07.5973767Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5973772Z 
2025-05-07T20:33:07.5974183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5974193Z 
2025-05-07T20:33:07.5974290Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5974511Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5974584Z     T=2048,
2025-05-07T20:33:07.5974653Z     D=7168,
2025-05-07T20:33:07.5974735Z     scale_ub=None,
2025-05-07T20:33:07.5974816Z     contiguous=False,
2025-05-07T20:33:07.5974893Z     compiled=True,
2025-05-07T20:33:07.5974967Z )
2025-05-07T20:33:07.5975179Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5975352Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:07.5975406Z 
2025-05-07T20:33:07.5975478Z     @given(
2025-05-07T20:33:07.5975595Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5975695Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5975807Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5975927Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5976042Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5976114Z     )
2025-05-07T20:33:07.5976356Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5976445Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5976519Z         self,
2025-05-07T20:33:07.5976594Z         T: int,
2025-05-07T20:33:07.5976664Z         D: int,
2025-05-07T20:33:07.5976761Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5976849Z         contiguous: bool,
2025-05-07T20:33:07.5976929Z         compiled: bool,
2025-05-07T20:33:07.5977004Z     ) -> None:
2025-05-07T20:33:07.5977142Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5977213Z     
2025-05-07T20:33:07.5977378Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5977451Z     
2025-05-07T20:33:07.5977539Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5977661Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5977791Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5977867Z         x0 = x[:, :D]
2025-05-07T20:33:07.5977941Z         x1 = x[:, D:]
2025-05-07T20:33:07.5978017Z     
2025-05-07T20:33:07.5978096Z         if contiguous:
2025-05-07T20:33:07.5978190Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5978275Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5978346Z     
2025-05-07T20:33:07.5978437Z         if scale_ub is not None:
2025-05-07T20:33:07.5978537Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5978666Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5978744Z             )
2025-05-07T20:33:07.5978817Z         else:
2025-05-07T20:33:07.5978905Z             scale_ub_tensor = None
2025-05-07T20:33:07.5978976Z     
2025-05-07T20:33:07.5979103Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5979187Z             op = silu_mul_quant
2025-05-07T20:33:07.5979316Z             if compiled:
2025-05-07T20:33:07.5979415Z                 op = torch.compile(op)
2025-05-07T20:33:07.5979521Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5979587Z     
2025-05-07T20:33:07.5979673Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5979678Z 
2025-05-07T20:33:07.5979776Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5979900Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5979995Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5980092Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5980458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.5980554Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.5981050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5981143Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5981504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5981725Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5982059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5982149Z     kernel = self.compile(
2025-05-07T20:33:07.5982526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5982702Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5982871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5982875Z 
2025-05-07T20:33:07.5983078Z self = <triton.compiler.compiler.ASTSource object at 0x7f891767b5e0>
2025-05-07T20:33:07.5983870Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5984377Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917658550>}
2025-05-07T20:33:07.5985130Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5985387Z context = <triton._C.libtriton.ir.context object at 0x7f8917a2a4b0>
2025-05-07T20:33:07.5985395Z 
2025-05-07T20:33:07.5985559Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5985822Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5985928Z                            module_map=module_map)
2025-05-07T20:33:07.5986131Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5986228Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5986304Z E       ^
2025-05-07T20:33:07.5986660Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5986665Z 
2025-05-07T20:33:07.5987073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5987078Z 
2025-05-07T20:33:07.5987180Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5987401Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5987474Z     T=4096,
2025-05-07T20:33:07.5987552Z     D=7168,
2025-05-07T20:33:07.5987629Z     scale_ub=None,
2025-05-07T20:33:07.5987713Z     contiguous=False,
2025-05-07T20:33:07.5987794Z     compiled=True,
2025-05-07T20:33:07.5987861Z )
2025-05-07T20:33:07.5988121Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5988298Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:07.5988303Z 
2025-05-07T20:33:07.5988375Z     @given(
2025-05-07T20:33:07.5988493Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5988587Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5988697Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5988814Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5988922Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5988998Z     )
2025-05-07T20:33:07.5989243Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5989332Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5989404Z         self,
2025-05-07T20:33:07.5989482Z         T: int,
2025-05-07T20:33:07.5989555Z         D: int,
2025-05-07T20:33:07.5989653Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5989741Z         contiguous: bool,
2025-05-07T20:33:07.5989910Z         compiled: bool,
2025-05-07T20:33:07.5990003Z     ) -> None:
2025-05-07T20:33:07.5990093Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5990161Z     
2025-05-07T20:33:07.5990337Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5990409Z     
2025-05-07T20:33:07.5990496Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5990617Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5990700Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5990778Z         x0 = x[:, :D]
2025-05-07T20:33:07.5990913Z         x1 = x[:, D:]
2025-05-07T20:33:07.5990981Z     
2025-05-07T20:33:07.5991061Z         if contiguous:
2025-05-07T20:33:07.5991150Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5991240Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5991315Z     
2025-05-07T20:33:07.5991404Z         if scale_ub is not None:
2025-05-07T20:33:07.5991511Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5991644Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5991716Z             )
2025-05-07T20:33:07.5991790Z         else:
2025-05-07T20:33:07.5991884Z             scale_ub_tensor = None
2025-05-07T20:33:07.5991955Z     
2025-05-07T20:33:07.5992079Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5992167Z             op = silu_mul_quant
2025-05-07T20:33:07.5992247Z             if compiled:
2025-05-07T20:33:07.5992341Z                 op = torch.compile(op)
2025-05-07T20:33:07.5992495Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5992568Z     
2025-05-07T20:33:07.5992660Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5992664Z 
2025-05-07T20:33:07.5992755Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5992921Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5993120Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5993242Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5993620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.5993712Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.5994203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5994309Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5994666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5994899Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5995240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5995330Z     kernel = self.compile(
2025-05-07T20:33:07.5995764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5995942Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5996065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5996070Z 
2025-05-07T20:33:07.5996276Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917a39c10>
2025-05-07T20:33:07.5997067Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5997578Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f891777b160>}
2025-05-07T20:33:07.5998338Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5998528Z context = <triton._C.libtriton.ir.context object at 0x7f891775f2b0>
2025-05-07T20:33:07.5998533Z 
2025-05-07T20:33:07.5998700Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5998957Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5999066Z                            module_map=module_map)
2025-05-07T20:33:07.5999223Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5999366Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5999445Z E       ^
2025-05-07T20:33:07.5999798Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5999802Z 
2025-05-07T20:33:07.6000216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.6000223Z 
2025-05-07T20:33:07.6000325Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6000541Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6000618Z     T=16384,
2025-05-07T20:33:07.6000700Z     D=5120,
2025-05-07T20:33:07.6000784Z     scale_ub=1200.0,
2025-05-07T20:33:07.6000866Z     contiguous=False,
2025-05-07T20:33:07.6000949Z     compiled=False,
2025-05-07T20:33:07.6001018Z )
2025-05-07T20:33:07.6001239Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6001468Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:07.6001477Z 
2025-05-07T20:33:07.6001549Z     @given(
2025-05-07T20:33:07.6001669Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6001762Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6001880Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6002038Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6002149Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6002223Z     )
2025-05-07T20:33:07.6002468Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6002564Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6002651Z         self,
2025-05-07T20:33:07.6002725Z         T: int,
2025-05-07T20:33:07.6002805Z         D: int,
2025-05-07T20:33:07.6002905Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6002991Z         contiguous: bool,
2025-05-07T20:33:07.6003074Z         compiled: bool,
2025-05-07T20:33:07.6003163Z     ) -> None:
2025-05-07T20:33:07.6003253Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6003322Z     
2025-05-07T20:33:07.6003485Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6003556Z     
2025-05-07T20:33:07.6003648Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.6004156Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.6004279Z         x = x_sign * x_clamp
2025-05-07T20:33:07.6004385Z         x0 = x[:, :D]
2025-05-07T20:33:07.6004489Z         x1 = x[:, D:]
2025-05-07T20:33:07.6004583Z     
2025-05-07T20:33:07.6004669Z         if contiguous:
2025-05-07T20:33:07.6004754Z             x0 = x0.contiguous()
2025-05-07T20:33:07.6004836Z             x1 = x1.contiguous()
2025-05-07T20:33:07.6004907Z     
2025-05-07T20:33:07.6004995Z         if scale_ub is not None:
2025-05-07T20:33:07.6005101Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.6005236Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.6005312Z             )
2025-05-07T20:33:07.6005390Z         else:
2025-05-07T20:33:07.6005479Z             scale_ub_tensor = None
2025-05-07T20:33:07.6005544Z     
2025-05-07T20:33:07.6005673Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.6005762Z             op = silu_mul_quant
2025-05-07T20:33:07.6005841Z             if compiled:
2025-05-07T20:33:07.6005937Z                 op = torch.compile(op)
2025-05-07T20:33:07.6006037Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6006108Z     
2025-05-07T20:33:07.6006200Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.6006204Z 
2025-05-07T20:33:07.6006298Z moe/activation_test.py:117: 
2025-05-07T20:33:07.6006426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6006523Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.6006614Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6007195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.6007286Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.6007637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.6007867Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.6008203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.6008297Z     kernel = self.compile(
2025-05-07T20:33:07.6008673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.6008842Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.6008967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6008973Z 
2025-05-07T20:33:07.6009234Z self = <triton.compiler.compiler.ASTSource object at 0x7f891796d1c0>
2025-05-07T20:33:07.6010019Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.6010578Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f891777b940>}
2025-05-07T20:33:07.6011323Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.6011515Z context = <triton._C.libtriton.ir.context object at 0x7f89175f04b0>
2025-05-07T20:33:07.6011520Z 
2025-05-07T20:33:07.6011683Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.6011951Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.6012054Z                            module_map=module_map)
2025-05-07T20:33:07.6012210Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.6012349Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.6012424Z E       ^
2025-05-07T20:33:07.6012776Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.6012785Z 
2025-05-07T20:33:07.6013193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.6013197Z 
2025-05-07T20:33:07.6013293Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6013515Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6013587Z     T=16384,
2025-05-07T20:33:07.6013666Z     D=5120,
2025-05-07T20:33:07.6013748Z     scale_ub=1200.0,
2025-05-07T20:33:07.6013825Z     contiguous=True,
2025-05-07T20:33:07.6013903Z     compiled=True,
2025-05-07T20:33:07.6013973Z )
2025-05-07T20:33:07.6014187Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6014364Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.6014372Z 
2025-05-07T20:33:07.6014443Z     @given(
2025-05-07T20:33:07.6014555Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6014655Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6014765Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6014877Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6014988Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6015059Z     )
2025-05-07T20:33:07.6015303Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6015467Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6015538Z         self,
2025-05-07T20:33:07.6015611Z         T: int,
2025-05-07T20:33:07.6015682Z         D: int,
2025-05-07T20:33:07.6015776Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6015863Z         contiguous: bool,
2025-05-07T20:33:07.6015944Z         compiled: bool,
2025-05-07T20:33:07.6016024Z     ) -> None:
2025-05-07T20:33:07.6016118Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6016188Z     
2025-05-07T20:33:07.6016355Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6016426Z     
2025-05-07T20:33:07.6016514Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.6016634Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.6016722Z         x = x_sign * x_clamp
2025-05-07T20:33:07.6016795Z         x0 = x[:, :D]
2025-05-07T20:33:07.6016870Z         x1 = x[:, D:]
2025-05-07T20:33:07.6016937Z     
2025-05-07T20:33:07.6017014Z         if contiguous:
2025-05-07T20:33:07.6017148Z             x0 = x0.contiguous()
2025-05-07T20:33:07.6017236Z             x1 = x1.contiguous()
2025-05-07T20:33:07.6017306Z     
2025-05-07T20:33:07.6017399Z         if scale_ub is not None:
2025-05-07T20:33:07.6017501Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.6017635Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.6017755Z             )
2025-05-07T20:33:07.6017829Z         else:
2025-05-07T20:33:07.6017918Z             scale_ub_tensor = None
2025-05-07T20:33:07.6017986Z     
2025-05-07T20:33:07.6018111Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.6018203Z             op = silu_mul_quant
2025-05-07T20:33:07.6018282Z             if compiled:
2025-05-07T20:33:07.6018383Z                 op = torch.compile(op)
2025-05-07T20:33:07.6018489Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6018554Z     
2025-05-07T20:33:07.6018640Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.6018648Z 
2025-05-07T20:33:07.6018749Z moe/activation_test.py:117: 
2025-05-07T20:33:07.6018873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6018969Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.6019066Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6019475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.6019572Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.6020066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.6020156Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.6020507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.6020732Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.6021073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.6021164Z     kernel = self.compile(
2025-05-07T20:33:07.6021544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.6021720Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.6021849Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6021854Z 
2025-05-07T20:33:07.6022054Z self = <triton.compiler.compiler.ASTSource object at 0x7f89179f57f0>
2025-05-07T20:33:07.6022832Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.6023337Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917551550>}
2025-05-07T20:33:07.6024121Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.6024319Z context = <triton._C.libtriton.ir.context object at 0x7f89175664b0>
2025-05-07T20:33:07.6024324Z 
2025-05-07T20:33:07.6024488Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.6024756Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.6024859Z                            module_map=module_map)
2025-05-07T20:33:07.6025021Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.6025117Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.6025186Z E       ^
2025-05-07T20:33:07.6025577Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.6025585Z 
2025-05-07T20:33:07.6025999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.6026004Z 
2025-05-07T20:33:07.6026105Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6026364Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6026439Z     T=16384,
2025-05-07T20:33:07.6026511Z     D=5120,
2025-05-07T20:33:07.6026588Z     scale_ub=None,
2025-05-07T20:33:07.6026670Z     contiguous=False,
2025-05-07T20:33:07.6026750Z     compiled=True,
2025-05-07T20:33:07.6026817Z )
2025-05-07T20:33:07.6027027Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6027197Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:07.6027205Z 
2025-05-07T20:33:07.6027276Z     @given(
2025-05-07T20:33:07.6027395Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6027495Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6027605Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6027718Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6027880Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6027951Z     )
2025-05-07T20:33:07.6028193Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6028288Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6028359Z         self,
2025-05-07T20:33:07.6028435Z         T: int,
2025-05-07T20:33:07.6028509Z         D: int,
2025-05-07T20:33:07.6028604Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6028696Z         contiguous: bool,
2025-05-07T20:33:07.6028775Z         compiled: bool,
2025-05-07T20:33:07.6028848Z     ) -> None:
2025-05-07T20:33:07.6028943Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6029015Z     
2025-05-07T20:33:07.6029182Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6029255Z     
2025-05-07T20:33:07.6029342Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.6029461Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.6029549Z         x = x_sign * x_clamp
2025-05-07T20:33:07.6029628Z         x0 = x[:, :D]
2025-05-07T20:33:07.6029705Z         x1 = x[:, D:]
2025-05-07T20:33:07.6029773Z     
2025-05-07T20:33:07.6029940Z         if contiguous:
2025-05-07T20:33:07.6030031Z             x0 = x0.contiguous()
2025-05-07T20:33:07.6030115Z             x1 = x1.contiguous()
2025-05-07T20:33:07.6030184Z     
2025-05-07T20:33:07.6030278Z         if scale_ub is not None:
2025-05-07T20:33:07.6030379Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.6030509Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.6030588Z             )
2025-05-07T20:33:07.6030662Z         else:
2025-05-07T20:33:07.6030805Z             scale_ub_tensor = None
2025-05-07T20:33:07.6030880Z     
2025-05-07T20:33:07.6031006Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.6031092Z             op = silu_mul_quant
2025-05-07T20:33:07.6031175Z             if compiled:
2025-05-07T20:33:07.6031270Z                 op = torch.compile(op)
2025-05-07T20:33:07.6031379Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6031449Z     
2025-05-07T20:33:07.6031536Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.6031541Z 
2025-05-07T20:33:07.6031638Z moe/activation_test.py:117: 
2025-05-07T20:33:07.6031760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6031855Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.6031956Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6032319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.6032449Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.6032948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.6033044Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.6033403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.6033664Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.6033999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.6034092Z     kernel = self.compile(
2025-05-07T20:33:07.6034467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.6034644Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.6034772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6034779Z 
2025-05-07T20:33:07.6034979Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917748100>
2025-05-07T20:33:07.6035802Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.6036304Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f891764a1f0>}
2025-05-07T20:33:07.6037050Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.6037237Z context = <triton._C.libtriton.ir.context object at 0x7f89176384f0>
2025-05-07T20:33:07.6037242Z 
2025-05-07T20:33:07.6037408Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.6037668Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.6037773Z                            module_map=module_map)
2025-05-07T20:33:07.6037941Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.6038038Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.6038114Z E       ^
2025-05-07T20:33:07.6038478Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.6038483Z 
2025-05-07T20:33:07.6038896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.6038900Z 
2025-05-07T20:33:07.6039003Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6039221Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6039340Z     T=2048,
2025-05-07T20:33:07.6039413Z     D=5120,
2025-05-07T20:33:07.6039490Z     scale_ub=None,
2025-05-07T20:33:07.6039572Z     contiguous=False,
2025-05-07T20:33:07.6039655Z     compiled=True,
2025-05-07T20:33:07.6039723Z )
2025-05-07T20:33:07.6039940Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6040119Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:07.6040124Z 
2025-05-07T20:33:07.6040198Z     @given(
2025-05-07T20:33:07.6040323Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6040418Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6040533Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6040651Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6040763Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6040830Z     )
2025-05-07T20:33:07.6041117Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6041210Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6041283Z         self,
2025-05-07T20:33:07.6041358Z         T: int,
2025-05-07T20:33:07.6041429Z         D: int,
2025-05-07T20:33:07.6041523Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6041610Z         contiguous: bool,
2025-05-07T20:33:07.6041733Z         compiled: bool,
2025-05-07T20:33:07.6041810Z     ) -> None:
2025-05-07T20:33:07.6041901Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6041966Z     
2025-05-07T20:33:07.6042135Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6042202Z     
2025-05-07T20:33:07.6042288Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.6042413Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.6042497Z         x = x_sign * x_clamp
2025-05-07T20:33:07.6042574Z         x0 = x[:, :D]
2025-05-07T20:33:07.6042652Z         x1 = x[:, D:]
2025-05-07T20:33:07.6042721Z     
2025-05-07T20:33:07.6042801Z         if contiguous:
2025-05-07T20:33:07.6042892Z             x0 = x0.contiguous()
2025-05-07T20:33:07.6042975Z             x1 = x1.contiguous()
2025-05-07T20:33:07.6043046Z     
2025-05-07T20:33:07.6043133Z         if scale_ub is not None:
2025-05-07T20:33:07.6043234Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.6043439Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.6043513Z             )
2025-05-07T20:33:07.6043589Z         else:
2025-05-07T20:33:07.6043685Z             scale_ub_tensor = None
2025-05-07T20:33:07.6043755Z     
2025-05-07T20:33:07.6043880Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.6043970Z             op = silu_mul_quant
2025-05-07T20:33:07.6044049Z             if compiled:
2025-05-07T20:33:07.6044147Z                 op = torch.compile(op)
2025-05-07T20:33:07.6044257Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6044325Z     
2025-05-07T20:33:07.6044422Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.6044427Z 
2025-05-07T20:33:07.6044520Z moe/activation_test.py:117: 
2025-05-07T20:33:07.6044644Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6044743Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.6044841Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6045207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.6045298Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.6045786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.6045884Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.6046239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.6046460Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.6046844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.6046932Z     kernel = self.compile(
2025-05-07T20:33:07.6047312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.6047487Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.6047610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6047614Z 
2025-05-07T20:33:07.6047819Z self = <triton.compiler.compiler.ASTSource object at 0x7f89175b4c40>
2025-05-07T20:33:07.6048599Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.6049140Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f891764af70>}
2025-05-07T20:33:07.6049893Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.6050114Z context = <triton._C.libtriton.ir.context object at 0x7f891745b6b0>
2025-05-07T20:33:07.6050119Z 
2025-05-07T20:33:07.6050284Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.6050542Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.6050645Z                            module_map=module_map)
2025-05-07T20:33:07.6050805Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.6050900Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.6050978Z E       ^
2025-05-07T20:33:07.6051337Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.6051342Z 
2025-05-07T20:33:07.6051799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.6051807Z 
2025-05-07T20:33:07.6051961Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6052180Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6052260Z     T=2048,
2025-05-07T20:33:07.6052331Z     D=5120,
2025-05-07T20:33:07.6052406Z     scale_ub=1200.0,
2025-05-07T20:33:07.6052488Z     contiguous=False,
2025-05-07T20:33:07.6052562Z     compiled=True,
2025-05-07T20:33:07.6052630Z )
2025-05-07T20:33:07.6052843Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6053010Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:07.6053017Z 
2025-05-07T20:33:07.6053092Z     @given(
2025-05-07T20:33:07.6053208Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6053302Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6053422Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6053536Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6053649Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6053722Z     )
2025-05-07T20:33:07.6053964Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6054052Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6054126Z         self,
2025-05-07T20:33:07.6054197Z         T: int,
2025-05-07T20:33:07.6054265Z         D: int,
2025-05-07T20:33:07.6054360Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6054444Z         contiguous: bool,
2025-05-07T20:33:07.6054522Z         compiled: bool,
2025-05-07T20:33:07.6054598Z     ) -> None:
2025-05-07T20:33:07.6054745Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6054812Z     
2025-05-07T20:33:07.6054974Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6055042Z     
2025-05-07T20:33:07.6055134Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.6055253Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.6055342Z         x = x_sign * x_clamp
2025-05-07T20:33:07.6055421Z         x0 = x[:, :D]
2025-05-07T20:33:07.6055495Z         x1 = x[:, D:]
2025-05-07T20:33:07.6055563Z     
2025-05-07T20:33:07.6055646Z         if contiguous:
2025-05-07T20:33:07.6055731Z             x0 = x0.contiguous()
2025-05-07T20:33:07.6055815Z             x1 = x1.contiguous()
2025-05-07T20:33:07.6055888Z     
2025-05-07T20:33:07.6055975Z         if scale_ub is not None:
2025-05-07T20:33:07.6056078Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.6056206Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.6056273Z             )
2025-05-07T20:33:07.6056444Z         else:
2025-05-07T20:33:07.6056537Z             scale_ub_tensor = None
2025-05-07T20:33:07.6056606Z     
2025-05-07T20:33:07.6056731Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.6056820Z             op = silu_mul_quant
2025-05-07T20:33:07.6056899Z             if compiled:
2025-05-07T20:33:07.6057040Z                 op = torch.compile(op)
2025-05-07T20:33:07.6057139Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6057208Z     
2025-05-07T20:33:07.6057300Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.6057304Z 
2025-05-07T20:33:07.6057397Z moe/activation_test.py:117: 
2025-05-07T20:33:07.6057525Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6057619Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.6057717Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6058087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.6058177Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.6058667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.6058761Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.6059155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.6059377Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.6059709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.6059798Z     kernel = self.compile(
2025-05-07T20:33:07.6060176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.6060346Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.6060474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6060484Z 
2025-05-07T20:33:07.6060684Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917475a90>
2025-05-07T20:33:07.6061465Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.6062007Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f891746e940>}
2025-05-07T20:33:07.6062763Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.6062956Z context = <triton._C.libtriton.ir.context object at 0x7f89174d0570>
2025-05-07T20:33:07.6063002Z 
2025-05-07T20:33:07.6063162Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.6063417Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.6063521Z                            module_map=module_map)
2025-05-07T20:33:07.6063683Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.6063777Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.6063850Z E       ^
2025-05-07T20:33:07.6064200Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.6064205Z 
2025-05-07T20:33:07.6064616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.6064621Z 
2025-05-07T20:33:07.6064716Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6064971Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6065047Z     T=4096,
2025-05-07T20:33:07.6065119Z     D=5120,
2025-05-07T20:33:07.6065201Z     scale_ub=1200.0,
2025-05-07T20:33:07.6065277Z     contiguous=True,
2025-05-07T20:33:07.6065355Z     compiled=True,
2025-05-07T20:33:07.6065427Z )
2025-05-07T20:33:07.6065682Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6065851Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.6065855Z 
2025-05-07T20:33:07.6065931Z     @given(
2025-05-07T20:33:07.6066041Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6066136Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6066250Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6066365Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6066475Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6066543Z     )
2025-05-07T20:33:07.6066791Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6066881Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6066951Z         self,
2025-05-07T20:33:07.6067024Z         T: int,
2025-05-07T20:33:07.6067098Z         D: int,
2025-05-07T20:33:07.6067229Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6067315Z         contiguous: bool,
2025-05-07T20:33:07.6067397Z         compiled: bool,
2025-05-07T20:33:07.6067468Z     ) -> None:
2025-05-07T20:33:07.6067557Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6067627Z     
2025-05-07T20:33:07.6067790Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6067861Z     
2025-05-07T20:33:07.6067947Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.6068064Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.6068152Z         x = x_sign * x_clamp
2025-05-07T20:33:07.6068227Z         x0 = x[:, :D]
2025-05-07T20:33:07.6068307Z         x1 = x[:, D:]
2025-05-07T20:33:07.6068380Z     
2025-05-07T20:33:07.6068457Z         if contiguous:
2025-05-07T20:33:07.6068543Z             x0 = x0.contiguous()
2025-05-07T20:33:07.6068630Z             x1 = x1.contiguous()
2025-05-07T20:33:07.6068699Z     
2025-05-07T20:33:07.6068786Z         if scale_ub is not None:
2025-05-07T20:33:07.6068893Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.6069022Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.6069100Z             )
2025-05-07T20:33:07.6069172Z         else:
2025-05-07T20:33:07.6069263Z             scale_ub_tensor = None
2025-05-07T20:33:07.6069338Z     
2025-05-07T20:33:07.6069462Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.6069544Z             op = silu_mul_quant
2025-05-07T20:33:07.6069630Z             if compiled:
2025-05-07T20:33:07.6069724Z                 op = torch.compile(op)
2025-05-07T20:33:07.6069902Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6070028Z     
2025-05-07T20:33:07.6070114Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.6070119Z 
2025-05-07T20:33:07.6070211Z moe/activation_test.py:117: 
2025-05-07T20:33:07.6070340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6070434Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.6070542Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6070910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.6070997Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.6071490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.6071582Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.6071936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.6072216Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.6072551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.6072643Z     kernel = self.compile(
2025-05-07T20:33:07.6073022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.6073252Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.6073380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6073384Z 
2025-05-07T20:33:07.6073583Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917501490>
2025-05-07T20:33:07.6074367Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.6074878Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917409790>}
2025-05-07T20:33:07.6075672Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.6075865Z context = <triton._C.libtriton.ir.context object at 0x7f89174435f0>
2025-05-07T20:33:07.6075870Z 
2025-05-07T20:33:07.6076030Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.6076292Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.6076393Z                            module_map=module_map)
2025-05-07T20:33:07.6076551Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.6076649Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.6076729Z E       ^
2025-05-07T20:33:07.6077089Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.6077094Z 
2025-05-07T20:33:07.6077507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.6077514Z 
2025-05-07T20:33:07.6077610Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6077831Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6077903Z     T=128,
2025-05-07T20:33:07.6077973Z     D=5120,
2025-05-07T20:33:07.6078062Z     scale_ub=1200.0,
2025-05-07T20:33:07.6078145Z     contiguous=False,
2025-05-07T20:33:07.6078228Z     compiled=True,
2025-05-07T20:33:07.6078296Z )
2025-05-07T20:33:07.6078509Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6078681Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:07.6078731Z 
2025-05-07T20:33:07.6078803Z     @given(
2025-05-07T20:33:07.6078916Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6079012Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6079125Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6079245Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6079359Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6079430Z     )
2025-05-07T20:33:07.6079675Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6079765Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6079844Z         self,
2025-05-07T20:33:07.6079920Z         T: int,
2025-05-07T20:33:07.6079997Z         D: int,
2025-05-07T20:33:07.6080093Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6080181Z         contiguous: bool,
2025-05-07T20:33:07.6080263Z         compiled: bool,
2025-05-07T20:33:07.6085186Z     ) -> None:
2025-05-07T20:33:07.6085311Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6085383Z     
2025-05-07T20:33:07.6085560Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6085635Z     
2025-05-07T20:33:07.6085729Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.6085901Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.6085985Z         x = x_sign * x_clamp
2025-05-07T20:33:07.6086062Z         x0 = x[:, :D]
2025-05-07T20:33:07.6086137Z         x1 = x[:, D:]
2025-05-07T20:33:07.6086206Z     
2025-05-07T20:33:07.6086298Z         if contiguous:
2025-05-07T20:33:07.6086387Z             x0 = x0.contiguous()
2025-05-07T20:33:07.6086470Z             x1 = x1.contiguous()
2025-05-07T20:33:07.6086543Z     
2025-05-07T20:33:07.6086631Z         if scale_ub is not None:
2025-05-07T20:33:07.6086734Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.6086870Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.6086950Z             )
2025-05-07T20:33:07.6087023Z         else:
2025-05-07T20:33:07.6087116Z             scale_ub_tensor = None
2025-05-07T20:33:07.6087184Z     
2025-05-07T20:33:07.6087324Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.6087409Z             op = silu_mul_quant
2025-05-07T20:33:07.6087535Z             if compiled:
2025-05-07T20:33:07.6087639Z                 op = torch.compile(op)
2025-05-07T20:33:07.6087741Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6087806Z     
2025-05-07T20:33:07.6087894Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.6087900Z 
2025-05-07T20:33:07.6087994Z moe/activation_test.py:117: 
2025-05-07T20:33:07.6088119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6088217Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.6088313Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6088700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.6088792Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.6089294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.6089396Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.6089758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.6089979Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.6090322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.6090413Z     kernel = self.compile(
2025-05-07T20:33:07.6090798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.6090975Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.6091141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6091145Z 
2025-05-07T20:33:07.6091351Z self = <triton.compiler.compiler.ASTSource object at 0x7f89172f1910>
2025-05-07T20:33:07.6092138Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.6092652Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89172fe0d0>}
2025-05-07T20:33:07.6093403Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.6093637Z context = <triton._C.libtriton.ir.context object at 0x7f89173050b0>
2025-05-07T20:33:07.6093642Z 
2025-05-07T20:33:07.6093805Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.6094071Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.6094221Z                            module_map=module_map)
2025-05-07T20:33:07.6094379Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.6094479Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.6094556Z E       ^
2025-05-07T20:33:07.6094908Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.6094913Z 
2025-05-07T20:33:07.6095328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.6095332Z 
2025-05-07T20:33:07.6095430Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6095654Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6095725Z     T=16384,
2025-05-07T20:33:07.6095796Z     D=7168,
2025-05-07T20:33:07.6095874Z     scale_ub=1200.0,
2025-05-07T20:33:07.6095955Z     contiguous=True,
2025-05-07T20:33:07.6096039Z     compiled=True,
2025-05-07T20:33:07.6096158Z )
2025-05-07T20:33:07.6096375Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6096544Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.6096549Z 
2025-05-07T20:33:07.6096624Z     @given(
2025-05-07T20:33:07.6096737Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6096839Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6096954Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6097069Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6097185Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6097266Z     )
2025-05-07T20:33:07.6097510Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6097604Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6097676Z         self,
2025-05-07T20:33:07.6097749Z         T: int,
2025-05-07T20:33:07.6097827Z         D: int,
2025-05-07T20:33:07.6097927Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6098015Z         contiguous: bool,
2025-05-07T20:33:07.6098101Z         compiled: bool,
2025-05-07T20:33:07.6098175Z     ) -> None:
2025-05-07T20:33:07.6098267Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6098341Z     
2025-05-07T20:33:07.6098507Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6098586Z     
2025-05-07T20:33:07.6098678Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.6098799Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.6098888Z         x = x_sign * x_clamp
2025-05-07T20:33:07.6099011Z         x0 = x[:, :D]
2025-05-07T20:33:07.6099087Z         x1 = x[:, D:]
2025-05-07T20:33:07.6099159Z     
2025-05-07T20:33:07.6099242Z         if contiguous:
2025-05-07T20:33:07.6099329Z             x0 = x0.contiguous()
2025-05-07T20:33:07.6099419Z             x1 = x1.contiguous()
2025-05-07T20:33:07.6099490Z     
2025-05-07T20:33:07.6099584Z         if scale_ub is not None:
2025-05-07T20:33:07.6099690Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.6099825Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.6099903Z             )
2025-05-07T20:33:07.6099981Z         else:
2025-05-07T20:33:07.6100073Z             scale_ub_tensor = None
2025-05-07T20:33:07.6100142Z     
2025-05-07T20:33:07.6100273Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.6100360Z             op = silu_mul_quant
2025-05-07T20:33:07.6100449Z             if compiled:
2025-05-07T20:33:07.6100544Z                 op = torch.compile(op)
2025-05-07T20:33:07.6101174Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6101249Z     
2025-05-07T20:33:07.6101335Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.6101339Z 
2025-05-07T20:33:07.6101438Z moe/activation_test.py:117: 
2025-05-07T20:33:07.6101565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6101705Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.6101798Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6102167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.6102253Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.6102758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.6102850Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.6103214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.6103452Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.6104115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.6104216Z     kernel = self.compile(
2025-05-07T20:33:07.6104699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.6104875Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.6105010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6105015Z 
2025-05-07T20:33:07.6105218Z self = <triton.compiler.compiler.ASTSource object at 0x7f89174ff850>
2025-05-07T20:33:07.6106019Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.6106530Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89172fed30>}
2025-05-07T20:33:07.6107286Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.6107484Z context = <triton._C.libtriton.ir.context object at 0x7f89172c9f70>
2025-05-07T20:33:07.6107488Z 
2025-05-07T20:33:07.6107653Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.6107922Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.6108030Z                            module_map=module_map)
2025-05-07T20:33:07.6108193Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.6108365Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.6108442Z E       ^
2025-05-07T20:33:07.6108798Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.6108808Z 
2025-05-07T20:33:07.6109225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.6109232Z 
2025-05-07T20:33:07.6109333Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6109559Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6109633Z     T=16384,
2025-05-07T20:33:07.6109706Z     D=5120,
2025-05-07T20:33:07.6109791Z     scale_ub=1200.0,
2025-05-07T20:33:07.6109951Z     contiguous=True,
2025-05-07T20:33:07.6110032Z     compiled=False,
2025-05-07T20:33:07.6110106Z )
2025-05-07T20:33:07.6110322Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6110569Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.6110574Z 
2025-05-07T20:33:07.6110648Z     @given(
2025-05-07T20:33:07.6110767Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6110871Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6111046Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6111161Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6111275Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6111346Z     )
2025-05-07T20:33:07.6111600Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6111697Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6111772Z         self,
2025-05-07T20:33:07.6111851Z         T: int,
2025-05-07T20:33:07.6111923Z         D: int,
2025-05-07T20:33:07.6112017Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6112106Z         contiguous: bool,
2025-05-07T20:33:07.6112193Z         compiled: bool,
2025-05-07T20:33:07.6112270Z     ) -> None:
2025-05-07T20:33:07.6112363Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6112433Z     
2025-05-07T20:33:07.6112599Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6112677Z     
2025-05-07T20:33:07.6112837Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.6112959Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.6113047Z         x = x_sign * x_clamp
2025-05-07T20:33:07.6113126Z         x0 = x[:, :D]
2025-05-07T20:33:07.6113205Z         x1 = x[:, D:]
2025-05-07T20:33:07.6113274Z     
2025-05-07T20:33:07.6113354Z         if contiguous:
2025-05-07T20:33:07.6113449Z             x0 = x0.contiguous()
2025-05-07T20:33:07.6113534Z             x1 = x1.contiguous()
2025-05-07T20:33:07.6113605Z     
2025-05-07T20:33:07.6113695Z         if scale_ub is not None:
2025-05-07T20:33:07.6113795Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.6113935Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.6114013Z             )
2025-05-07T20:33:07.6114091Z         else:
2025-05-07T20:33:07.6114182Z             scale_ub_tensor = None
2025-05-07T20:33:07.6114254Z     
2025-05-07T20:33:07.6114384Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.6114481Z             op = silu_mul_quant
2025-05-07T20:33:07.6114567Z             if compiled:
2025-05-07T20:33:07.6114665Z                 op = torch.compile(op)
2025-05-07T20:33:07.6114768Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6114839Z     
2025-05-07T20:33:07.6114926Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.6114931Z 
2025-05-07T20:33:07.6115029Z moe/activation_test.py:117: 
2025-05-07T20:33:07.6115155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6115251Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.6115352Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6115901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.6116000Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.6116359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.6116582Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.6116924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.6117015Z     kernel = self.compile(
2025-05-07T20:33:07.6117393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.6117568Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.6117689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6117738Z 
2025-05-07T20:33:07.6117946Z self = <triton.compiler.compiler.ASTSource object at 0x7f89176f4fa0>
2025-05-07T20:33:07.6118727Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.6119270Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f891725c700>}
2025-05-07T20:33:07.6120025Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.6120213Z context = <triton._C.libtriton.ir.context object at 0x7f89176e9370>
2025-05-07T20:33:07.6120218Z 
2025-05-07T20:33:07.6120394Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.6120658Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.6120770Z                            module_map=module_map)
2025-05-07T20:33:07.6120931Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.6121067Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.6121144Z E       ^
2025-05-07T20:33:07.6121499Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.6121504Z 
2025-05-07T20:33:07.6121915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.6121920Z 
2025-05-07T20:33:07.6122026Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6122245Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6122327Z     T=1,
2025-05-07T20:33:07.6122411Z     D=7168,
2025-05-07T20:33:07.6122491Z     scale_ub=1200.0,
2025-05-07T20:33:07.6122580Z     contiguous=False,
2025-05-07T20:33:07.6122660Z     compiled=False,
2025-05-07T20:33:07.6122731Z )
2025-05-07T20:33:07.6122952Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6123119Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:07.6123124Z 
2025-05-07T20:33:07.6123198Z     @given(
2025-05-07T20:33:07.6123317Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6123411Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6123526Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6123640Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6123752Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6123825Z     )
2025-05-07T20:33:07.6124074Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6124212Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6124288Z         self,
2025-05-07T20:33:07.6124360Z         T: int,
2025-05-07T20:33:07.6124432Z         D: int,
2025-05-07T20:33:07.6124532Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6124617Z         contiguous: bool,
2025-05-07T20:33:07.6124704Z         compiled: bool,
2025-05-07T20:33:07.6124782Z     ) -> None:
2025-05-07T20:33:07.6124873Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6124945Z     
2025-05-07T20:33:07.6125110Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6125182Z     
2025-05-07T20:33:07.6125273Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.6125393Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.6125481Z         x = x_sign * x_clamp
2025-05-07T20:33:07.6125562Z         x0 = x[:, :D]
2025-05-07T20:33:07.6125638Z         x1 = x[:, D:]
2025-05-07T20:33:07.6125708Z     
2025-05-07T20:33:07.6125840Z         if contiguous:
2025-05-07T20:33:07.6125930Z             x0 = x0.contiguous()
2025-05-07T20:33:07.6126015Z             x1 = x1.contiguous()
2025-05-07T20:33:07.6126087Z     
2025-05-07T20:33:07.6126178Z         if scale_ub is not None:
2025-05-07T20:33:07.6126282Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.6126460Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.6126533Z             )
2025-05-07T20:33:07.6126611Z         else:
2025-05-07T20:33:07.6126702Z             scale_ub_tensor = None
2025-05-07T20:33:07.6126773Z     
2025-05-07T20:33:07.6126903Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.6126990Z             op = silu_mul_quant
2025-05-07T20:33:07.6127070Z             if compiled:
2025-05-07T20:33:07.6127169Z                 op = torch.compile(op)
2025-05-07T20:33:07.6127273Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6127345Z     
2025-05-07T20:33:07.6127435Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.6127445Z 
2025-05-07T20:33:07.6127539Z moe/activation_test.py:117: 
2025-05-07T20:33:07.6127666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6127763Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.6127857Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6128404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.6128500Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.6128855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.6129078Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.6129413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.6129513Z     kernel = self.compile(
2025-05-07T20:33:07.6129897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.6130071Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.6130199Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6130208Z 
2025-05-07T20:33:07.6130411Z self = <triton.compiler.compiler.ASTSource object at 0x7f89176eb7c0>
2025-05-07T20:33:07.6131194Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.6131701Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89173940d0>}
2025-05-07T20:33:07.6132499Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.6132731Z context = <triton._C.libtriton.ir.context object at 0x7f8917396a70>
2025-05-07T20:33:07.6132736Z 
2025-05-07T20:33:07.6132902Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.6133175Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.6133281Z                            module_map=module_map)
2025-05-07T20:33:07.6133440Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.6133539Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.6133612Z E       ^
2025-05-07T20:33:07.6133969Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.6133977Z 
2025-05-07T20:33:07.6134426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.6134433Z 
2025-05-07T20:33:07.6134534Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6134759Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6134834Z     T=4096,
2025-05-07T20:33:07.6134948Z     D=7168,
2025-05-07T20:33:07.6135033Z     scale_ub=1200.0,
2025-05-07T20:33:07.6135117Z     contiguous=False,
2025-05-07T20:33:07.6135196Z     compiled=True,
2025-05-07T20:33:07.6135270Z )
2025-05-07T20:33:07.6135487Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6135661Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:07.6135666Z 
2025-05-07T20:33:07.6135739Z     @given(
2025-05-07T20:33:07.6135855Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6135958Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6136077Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6136193Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6136308Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6136380Z     )
2025-05-07T20:33:07.6136670Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6136767Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6136841Z         self,
2025-05-07T20:33:07.6136918Z         T: int,
2025-05-07T20:33:07.6136991Z         D: int,
2025-05-07T20:33:07.6137085Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6137172Z         contiguous: bool,
2025-05-07T20:33:07.6137258Z         compiled: bool,
2025-05-07T20:33:07.6137332Z     ) -> None:
2025-05-07T20:33:07.6137425Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6137496Z     
2025-05-07T20:33:07.6137659Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6137733Z     
2025-05-07T20:33:07.6137827Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.6137946Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.6138034Z         x = x_sign * x_clamp
2025-05-07T20:33:07.6138112Z         x0 = x[:, :D]
2025-05-07T20:33:07.6138190Z         x1 = x[:, D:]
2025-05-07T20:33:07.6138259Z     
2025-05-07T20:33:07.6138344Z         if contiguous:
2025-05-07T20:33:07.6138435Z             x0 = x0.contiguous()
2025-05-07T20:33:07.6138521Z             x1 = x1.contiguous()
2025-05-07T20:33:07.6138593Z     
2025-05-07T20:33:07.6138683Z         if scale_ub is not None:
2025-05-07T20:33:07.6138787Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.6138918Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.6138997Z             )
2025-05-07T20:33:07.6139071Z         else:
2025-05-07T20:33:07.6139160Z             scale_ub_tensor = None
2025-05-07T20:33:07.6139231Z     
2025-05-07T20:33:07.6139358Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.6139501Z             op = silu_mul_quant
2025-05-07T20:33:07.6139585Z             if compiled:
2025-05-07T20:33:07.6139683Z                 op = torch.compile(op)
2025-05-07T20:33:07.6139790Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6139861Z     
2025-05-07T20:33:07.6139948Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.6139957Z 
2025-05-07T20:33:07.6140055Z moe/activation_test.py:117: 
2025-05-07T20:33:07.6140179Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6140277Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.6140374Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6140740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.6140834Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.6141370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.6141471Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.6141838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.6142100Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.6142502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.6142597Z     kernel = self.compile(
2025-05-07T20:33:07.6142975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.6143154Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.6143277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6143281Z 
2025-05-07T20:33:07.6143483Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917196100>
2025-05-07T20:33:07.6144274Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.6144821Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917394dc0>}
2025-05-07T20:33:07.6145571Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.6145766Z context = <triton._C.libtriton.ir.context object at 0x7f8917312a30>
2025-05-07T20:33:07.6145771Z 
2025-05-07T20:33:07.6145936Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.6146203Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.6146313Z                            module_map=module_map)
2025-05-07T20:33:07.6146478Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.6146580Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.6146655Z E       ^
2025-05-07T20:33:07.6147014Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.6147019Z 
2025-05-07T20:33:07.6147437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.6147442Z 
2025-05-07T20:33:07.6147540Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6147764Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6147841Z     T=128,
2025-05-07T20:33:07.6147916Z     D=7168,
2025-05-07T20:33:07.6148002Z     scale_ub=1200.0,
2025-05-07T20:33:07.6148086Z     contiguous=False,
2025-05-07T20:33:07.6148212Z     compiled=True,
2025-05-07T20:33:07.6148285Z )
2025-05-07T20:33:07.6148499Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6148667Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:07.6148675Z 
2025-05-07T20:33:07.6148754Z     @given(
2025-05-07T20:33:07.6148870Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6148968Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6149081Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6149196Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6149310Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6149382Z     )
2025-05-07T20:33:07.6149628Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6149720Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6149860Z         self,
2025-05-07T20:33:07.6150000Z         T: int,
2025-05-07T20:33:07.6150075Z         D: int,
2025-05-07T20:33:07.6150170Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6150258Z         contiguous: bool,
2025-05-07T20:33:07.6150341Z         compiled: bool,
2025-05-07T20:33:07.6150417Z     ) -> None:
2025-05-07T20:33:07.6150511Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6150625Z     
2025-05-07T20:33:07.6150790Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6150868Z     
2025-05-07T20:33:07.6150957Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.6151076Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.6151167Z         x = x_sign * x_clamp
2025-05-07T20:33:07.6151245Z         x0 = x[:, :D]
2025-05-07T20:33:07.6151324Z         x1 = x[:, D:]
2025-05-07T20:33:07.6151399Z     
2025-05-07T20:33:07.6151479Z         if contiguous:
2025-05-07T20:33:07.6151571Z             x0 = x0.contiguous()
2025-05-07T20:33:07.6151658Z             x1 = x1.contiguous()
2025-05-07T20:33:07.6151737Z     
2025-05-07T20:33:07.6151828Z         if scale_ub is not None:
2025-05-07T20:33:07.6151930Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.6152064Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.6152140Z             )
2025-05-07T20:33:07.6152218Z         else:
2025-05-07T20:33:07.6152350Z             scale_ub_tensor = None
2025-05-07T20:33:07.6152424Z     
2025-05-07T20:33:07.6152551Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.6152637Z             op = silu_mul_quant
2025-05-07T20:33:07.6152723Z             if compiled:
2025-05-07T20:33:07.6152821Z                 op = torch.compile(op)
2025-05-07T20:33:07.6152927Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6152997Z     
2025-05-07T20:33:07.6153084Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.6153088Z 
2025-05-07T20:33:07.6153188Z moe/activation_test.py:117: 
2025-05-07T20:33:07.6153315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6153415Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.6153515Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6153880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.6153973Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.6154470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.6154567Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.6154922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.6155143Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.6155480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.6155622Z     kernel = self.compile(
2025-05-07T20:33:07.6155997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.6156180Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.6156306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6156314Z 
2025-05-07T20:33:07.6156517Z self = <triton.compiler.compiler.ASTSource object at 0x7f89171d41c0>
2025-05-07T20:33:07.6157303Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.6157805Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89171c1940>}
2025-05-07T20:33:07.6158595Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.6158784Z context = <triton._C.libtriton.ir.context object at 0x7f8917187a30>
2025-05-07T20:33:07.6158826Z 
2025-05-07T20:33:07.6158991Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.6159253Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.6159359Z                            module_map=module_map)
2025-05-07T20:33:07.6159524Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.6159621Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.6159697Z E       ^
2025-05-07T20:33:07.6160055Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.6160063Z 
2025-05-07T20:33:07.6160479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.6160484Z 
2025-05-07T20:33:07.6160589Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6160847Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6160926Z     T=2048,
2025-05-07T20:33:07.6161002Z     D=7168,
2025-05-07T20:33:07.6161081Z     scale_ub=None,
2025-05-07T20:33:07.6161163Z     contiguous=True,
2025-05-07T20:33:07.6161246Z     compiled=True,
2025-05-07T20:33:07.6161317Z )
2025-05-07T20:33:07.6161530Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6161703Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.6161708Z 
2025-05-07T20:33:07.6161781Z     @given(
2025-05-07T20:33:07.6161900Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6162000Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6162115Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6162232Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6162343Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6162414Z     )
2025-05-07T20:33:07.6162664Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6162756Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6162829Z         self,
2025-05-07T20:33:07.6162906Z         T: int,
2025-05-07T20:33:07.6162980Z         D: int,
2025-05-07T20:33:07.6163076Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6163166Z         contiguous: bool,
2025-05-07T20:33:07.6163249Z         compiled: bool,
2025-05-07T20:33:07.6163331Z     ) -> None:
2025-05-07T20:33:07.6163425Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6163493Z     
2025-05-07T20:33:07.6163663Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6163779Z     
2025-05-07T20:33:07.6163870Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.6163995Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.6164081Z         x = x_sign * x_clamp
2025-05-07T20:33:07.6164159Z         x0 = x[:, :D]
2025-05-07T20:33:07.6164237Z         x1 = x[:, D:]
2025-05-07T20:33:07.6164312Z     
2025-05-07T20:33:07.6164393Z         if contiguous:
2025-05-07T20:33:07.6164483Z             x0 = x0.contiguous()
2025-05-07T20:33:07.6164568Z             x1 = x1.contiguous()
2025-05-07T20:33:07.6164641Z     
2025-05-07T20:33:07.6164729Z         if scale_ub is not None:
2025-05-07T20:33:07.6164830Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.6164967Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.6165040Z             )
2025-05-07T20:33:07.6165114Z         else:
2025-05-07T20:33:07.6165207Z             scale_ub_tensor = None
2025-05-07T20:33:07.6165277Z     
2025-05-07T20:33:07.6165446Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.6165541Z             op = silu_mul_quant
2025-05-07T20:33:07.6165623Z             if compiled:
2025-05-07T20:33:07.6165718Z                 op = torch.compile(op)
2025-05-07T20:33:07.6165825Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6165934Z     
2025-05-07T20:33:07.6166030Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.6166035Z 
2025-05-07T20:33:07.6166129Z moe/activation_test.py:117: 
2025-05-07T20:33:07.6166255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6166355Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.6166450Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6166816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.6166910Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.6167406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.6167511Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.6167869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.6168134Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.6168478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.6168572Z     kernel = self.compile(
2025-05-07T20:33:07.6168951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.6169128Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.6169251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6169256Z 
2025-05-07T20:33:07.6169464Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917334400>
2025-05-07T20:33:07.6170247Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.6170757Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8917497550>}
2025-05-07T20:33:07.6171526Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.6171751Z context = <triton._C.libtriton.ir.context object at 0x7f891705cab0>
2025-05-07T20:33:07.6171757Z 
2025-05-07T20:33:07.6171932Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.6172199Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.6172350Z                            module_map=module_map)
2025-05-07T20:33:07.6172513Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.6172611Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.6172693Z E       ^
2025-05-07T20:33:07.6173050Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.6173055Z 
2025-05-07T20:33:07.6173467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.6173471Z 
2025-05-07T20:33:07.6173575Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6173793Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6173873Z     T=16384,
2025-05-07T20:33:07.6173946Z     D=5120,
2025-05-07T20:33:07.6174023Z     scale_ub=None,
2025-05-07T20:33:07.6174178Z     contiguous=False,
2025-05-07T20:33:07.6174262Z     compiled=False,
2025-05-07T20:33:07.6174332Z )
2025-05-07T20:33:07.6174550Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6174725Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.6174771Z 
2025-05-07T20:33:07.6174845Z     @given(
2025-05-07T20:33:07.6174969Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6175065Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6175182Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6175299Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6175409Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6175483Z     )
2025-05-07T20:33:07.6175727Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6175815Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6175903Z         self,
2025-05-07T20:33:07.6175975Z         T: int,
2025-05-07T20:33:07.6176047Z         D: int,
2025-05-07T20:33:07.6176144Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6176229Z         contiguous: bool,
2025-05-07T20:33:07.6176315Z         compiled: bool,
2025-05-07T20:33:07.6176395Z     ) -> None:
2025-05-07T20:33:07.6176528Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6176608Z     
2025-05-07T20:33:07.6176772Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6176843Z     
2025-05-07T20:33:07.6176935Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.6177056Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.6178861Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.6178873Z 
2025-05-07T20:33:07.6178992Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:07.6178997Z 
2025-05-07T20:33:07.6179094Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6179316Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6179389Z     T=4096,
2025-05-07T20:33:07.6179461Z     D=7168,
2025-05-07T20:33:07.6179543Z     scale_ub=1200.0,
2025-05-07T20:33:07.6179623Z     contiguous=True,
2025-05-07T20:33:07.6179705Z     compiled=True,
2025-05-07T20:33:07.6179776Z )
2025-05-07T20:33:07.6179989Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6180167Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.6180215Z 
2025-05-07T20:33:07.6180288Z     @given(
2025-05-07T20:33:07.6180401Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6180501Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6180615Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6180732Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6180847Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6180920Z     )
2025-05-07T20:33:07.6181177Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6181268Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6181341Z         self,
2025-05-07T20:33:07.6181417Z         T: int,
2025-05-07T20:33:07.6181488Z         D: int,
2025-05-07T20:33:07.6181583Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6181670Z         contiguous: bool,
2025-05-07T20:33:07.6181754Z         compiled: bool,
2025-05-07T20:33:07.6181895Z     ) -> None:
2025-05-07T20:33:07.6182011Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6182087Z     
2025-05-07T20:33:07.6182251Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6182327Z     
2025-05-07T20:33:07.6182417Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.6182582Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.6184374Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.6184386Z 
2025-05-07T20:33:07.6184507Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:07.6184512Z 
2025-05-07T20:33:07.6184619Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6184929Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6185030Z     T=16384,
2025-05-07T20:33:07.6185219Z     D=7168,
2025-05-07T20:33:07.6185304Z     scale_ub=None,
2025-05-07T20:33:07.6185388Z     contiguous=False,
2025-05-07T20:33:07.6185468Z     compiled=False,
2025-05-07T20:33:07.6185536Z )
2025-05-07T20:33:07.6185752Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6185922Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.6185927Z 
2025-05-07T20:33:07.6186004Z     @given(
2025-05-07T20:33:07.6186117Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6186211Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6186328Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6186438Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6186545Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6186616Z     )
2025-05-07T20:33:07.6186861Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6186957Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6187028Z         self,
2025-05-07T20:33:07.6187096Z         T: int,
2025-05-07T20:33:07.6187169Z         D: int,
2025-05-07T20:33:07.6187260Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6187349Z         contiguous: bool,
2025-05-07T20:33:07.6187433Z         compiled: bool,
2025-05-07T20:33:07.6187506Z     ) -> None:
2025-05-07T20:33:07.6187595Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6187667Z     
2025-05-07T20:33:07.6187829Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6189646Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.6189713Z 
2025-05-07T20:33:07.6189886Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.6189891Z 
2025-05-07T20:33:07.6189987Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6190212Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6190280Z     T=2048,
2025-05-07T20:33:07.6190354Z     D=7168,
2025-05-07T20:33:07.6190434Z     scale_ub=1200.0,
2025-05-07T20:33:07.6190565Z     contiguous=True,
2025-05-07T20:33:07.6190650Z     compiled=True,
2025-05-07T20:33:07.6190720Z )
2025-05-07T20:33:07.6190941Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6191112Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.6191161Z 
2025-05-07T20:33:07.6191235Z     @given(
2025-05-07T20:33:07.6191349Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6191443Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6191551Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6191665Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6191773Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6191842Z     )
2025-05-07T20:33:07.6192087Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6192174Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6192243Z         self,
2025-05-07T20:33:07.6192322Z         T: int,
2025-05-07T20:33:07.6192393Z         D: int,
2025-05-07T20:33:07.6192484Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6192578Z         contiguous: bool,
2025-05-07T20:33:07.6192658Z         compiled: bool,
2025-05-07T20:33:07.6192729Z     ) -> None:
2025-05-07T20:33:07.6192866Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6192935Z     
2025-05-07T20:33:07.6193101Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6193168Z     
2025-05-07T20:33:07.6193255Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.6193377Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.6195171Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.6195183Z 
2025-05-07T20:33:07.6195303Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:07.6195310Z 
2025-05-07T20:33:07.6195406Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6195627Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6195702Z     T=2048,
2025-05-07T20:33:07.6195771Z     D=7168,
2025-05-07T20:33:07.6195846Z     scale_ub=None,
2025-05-07T20:33:07.6195930Z     contiguous=True,
2025-05-07T20:33:07.6196008Z     compiled=False,
2025-05-07T20:33:07.6196079Z )
2025-05-07T20:33:07.6196290Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6196462Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.6196509Z 
2025-05-07T20:33:07.6196590Z     @given(
2025-05-07T20:33:07.6196703Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6196796Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6196910Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6197027Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6197136Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6197213Z     )
2025-05-07T20:33:07.6197455Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6197550Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6197625Z         self,
2025-05-07T20:33:07.6197694Z         T: int,
2025-05-07T20:33:07.6197765Z         D: int,
2025-05-07T20:33:07.6197857Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6197939Z         contiguous: bool,
2025-05-07T20:33:07.6198024Z         compiled: bool,
2025-05-07T20:33:07.6198147Z     ) -> None:
2025-05-07T20:33:07.6198238Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6198308Z     
2025-05-07T20:33:07.6198470Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6198541Z     
2025-05-07T20:33:07.6198626Z >       x_sign = torch.sign(x)
2025-05-07T20:33:07.6200458Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.6200468Z 
2025-05-07T20:33:07.6200588Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:07.6200599Z 
2025-05-07T20:33:07.6200695Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6200925Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6200999Z     T=1,
2025-05-07T20:33:07.6201071Z     D=7168,
2025-05-07T20:33:07.6201151Z     scale_ub=1200.0,
2025-05-07T20:33:07.6201273Z     contiguous=True,
2025-05-07T20:33:07.6201353Z     compiled=False,
2025-05-07T20:33:07.6201425Z )
2025-05-07T20:33:07.6201635Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6201794Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.6201802Z 
2025-05-07T20:33:07.6201874Z     @given(
2025-05-07T20:33:07.6201984Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6202080Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6202190Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6202302Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6202424Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6202493Z     )
2025-05-07T20:33:07.6202733Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6202830Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6202903Z         self,
2025-05-07T20:33:07.6202975Z         T: int,
2025-05-07T20:33:07.6203050Z         D: int,
2025-05-07T20:33:07.6203146Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6203231Z         contiguous: bool,
2025-05-07T20:33:07.6203311Z         compiled: bool,
2025-05-07T20:33:07.6203382Z     ) -> None:
2025-05-07T20:33:07.6203472Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6203549Z     
2025-05-07T20:33:07.6204027Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6204109Z     
2025-05-07T20:33:07.6204204Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.6204328Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.6204539Z         x = x_sign * x_clamp
2025-05-07T20:33:07.6204617Z         x0 = x[:, :D]
2025-05-07T20:33:07.6209124Z         x1 = x[:, D:]
2025-05-07T20:33:07.6209215Z     
2025-05-07T20:33:07.6209301Z         if contiguous:
2025-05-07T20:33:07.6209395Z             x0 = x0.contiguous()
2025-05-07T20:33:07.6209490Z             x1 = x1.contiguous()
2025-05-07T20:33:07.6209562Z     
2025-05-07T20:33:07.6209649Z         if scale_ub is not None:
2025-05-07T20:33:07.6209751Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.6209886Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.6209956Z             )
2025-05-07T20:33:07.6210029Z         else:
2025-05-07T20:33:07.6210125Z             scale_ub_tensor = None
2025-05-07T20:33:07.6210196Z     
2025-05-07T20:33:07.6210323Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.6210413Z             op = silu_mul_quant
2025-05-07T20:33:07.6210499Z             if compiled:
2025-05-07T20:33:07.6210711Z                 op = torch.compile(op)
2025-05-07T20:33:07.6210819Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6210887Z     
2025-05-07T20:33:07.6210975Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.6210980Z 
2025-05-07T20:33:07.6211077Z moe/activation_test.py:117: 
2025-05-07T20:33:07.6211298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6211403Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.6211500Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6212008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.6212115Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.6212473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.6212699Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.6213037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.6213125Z     kernel = self.compile(
2025-05-07T20:33:07.6213565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.6213755Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.6213879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6213887Z 
2025-05-07T20:33:07.6214089Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917117040>
2025-05-07T20:33:07.6214876Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.6215391Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89170df040>}
2025-05-07T20:33:07.6216148Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.6216346Z context = <triton._C.libtriton.ir.context object at 0x7f89170cdd70>
2025-05-07T20:33:07.6216350Z 
2025-05-07T20:33:07.6216513Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.6216772Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.6216881Z                            module_map=module_map)
2025-05-07T20:33:07.6217039Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.6217137Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.6217210Z E       ^
2025-05-07T20:33:07.6217620Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.6217625Z 
2025-05-07T20:33:07.6218038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.6218045Z 
2025-05-07T20:33:07.6218145Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6218364Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6218440Z     T=128,
2025-05-07T20:33:07.6218513Z     D=5120,
2025-05-07T20:33:07.6218595Z     scale_ub=None,
2025-05-07T20:33:07.6218675Z     contiguous=True,
2025-05-07T20:33:07.6218754Z     compiled=False,
2025-05-07T20:33:07.6218835Z )
2025-05-07T20:33:07.6219054Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6219218Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.6219227Z 
2025-05-07T20:33:07.6219346Z     @given(
2025-05-07T20:33:07.6219463Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6219555Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6219668Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6219785Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6219937Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6220007Z     )
2025-05-07T20:33:07.6220249Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6220354Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6220428Z         self,
2025-05-07T20:33:07.6220498Z         T: int,
2025-05-07T20:33:07.6220570Z         D: int,
2025-05-07T20:33:07.6220665Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6220748Z         contiguous: bool,
2025-05-07T20:33:07.6220829Z         compiled: bool,
2025-05-07T20:33:07.6220903Z     ) -> None:
2025-05-07T20:33:07.6221005Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6221080Z     
2025-05-07T20:33:07.6221245Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6221322Z     
2025-05-07T20:33:07.6221406Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.6221525Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.6221658Z         x = x_sign * x_clamp
2025-05-07T20:33:07.6221737Z         x0 = x[:, :D]
2025-05-07T20:33:07.6221812Z         x1 = x[:, D:]
2025-05-07T20:33:07.6221891Z     
2025-05-07T20:33:07.6221969Z         if contiguous:
2025-05-07T20:33:07.6222054Z             x0 = x0.contiguous()
2025-05-07T20:33:07.6222142Z             x1 = x1.contiguous()
2025-05-07T20:33:07.6222208Z     
2025-05-07T20:33:07.6222294Z         if scale_ub is not None:
2025-05-07T20:33:07.6222396Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.6222527Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.6222603Z             )
2025-05-07T20:33:07.6222678Z         else:
2025-05-07T20:33:07.6222767Z             scale_ub_tensor = None
2025-05-07T20:33:07.6222837Z     
2025-05-07T20:33:07.6222962Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.6223045Z             op = silu_mul_quant
2025-05-07T20:33:07.6223136Z             if compiled:
2025-05-07T20:33:07.6223237Z                 op = torch.compile(op)
2025-05-07T20:33:07.6223339Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6223412Z     
2025-05-07T20:33:07.6223497Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.6223501Z 
2025-05-07T20:33:07.6223596Z moe/activation_test.py:117: 
2025-05-07T20:33:07.6223730Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6223824Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.6223919Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6224424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.6224561Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.6224929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.6225155Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.6225495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.6225583Z     kernel = self.compile(
2025-05-07T20:33:07.6225961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.6226134Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.6226255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6226260Z 
2025-05-07T20:33:07.6226459Z self = <triton.compiler.compiler.ASTSource object at 0x7f89170d2fd0>
2025-05-07T20:33:07.6227289Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.6227805Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89170dfa60>}
2025-05-07T20:33:07.6228596Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.6228853Z context = <triton._C.libtriton.ir.context object at 0x7f8916fa85b0>
2025-05-07T20:33:07.6228861Z 
2025-05-07T20:33:07.6229075Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.6229343Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.6229451Z                            module_map=module_map)
2025-05-07T20:33:07.6229614Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.6229708Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.6229779Z E       ^
2025-05-07T20:33:07.6230283Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.6230289Z 
2025-05-07T20:33:07.6230701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.6230706Z 
2025-05-07T20:33:07.6230810Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6231028Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6231101Z     T=128,
2025-05-07T20:33:07.6231177Z     D=7168,
2025-05-07T20:33:07.6231255Z     scale_ub=None,
2025-05-07T20:33:07.6231343Z     contiguous=True,
2025-05-07T20:33:07.6231430Z     compiled=False,
2025-05-07T20:33:07.6231502Z )
2025-05-07T20:33:07.6231716Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6231889Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.6231899Z 
2025-05-07T20:33:07.6231977Z     @given(
2025-05-07T20:33:07.6232099Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6232197Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6232309Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6232427Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6232537Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6232608Z     )
2025-05-07T20:33:07.6232856Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6232948Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6233025Z         self,
2025-05-07T20:33:07.6233148Z         T: int,
2025-05-07T20:33:07.6233222Z         D: int,
2025-05-07T20:33:07.6233320Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6233405Z         contiguous: bool,
2025-05-07T20:33:07.6233488Z         compiled: bool,
2025-05-07T20:33:07.6233570Z     ) -> None:
2025-05-07T20:33:07.6233661Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6233735Z     
2025-05-07T20:33:07.6233902Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6233973Z     
2025-05-07T20:33:07.6234064Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.6234189Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.6234274Z         x = x_sign * x_clamp
2025-05-07T20:33:07.6234351Z         x0 = x[:, :D]
2025-05-07T20:33:07.6234431Z         x1 = x[:, D:]
2025-05-07T20:33:07.6234499Z     
2025-05-07T20:33:07.6234583Z         if contiguous:
2025-05-07T20:33:07.6234669Z             x0 = x0.contiguous()
2025-05-07T20:33:07.6234798Z             x1 = x1.contiguous()
2025-05-07T20:33:07.6234875Z     
2025-05-07T20:33:07.6234963Z         if scale_ub is not None:
2025-05-07T20:33:07.6235064Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.6235201Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.6235273Z             )
2025-05-07T20:33:07.6235388Z         else:
2025-05-07T20:33:07.6235482Z             scale_ub_tensor = None
2025-05-07T20:33:07.6235553Z     
2025-05-07T20:33:07.6235679Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.6235768Z             op = silu_mul_quant
2025-05-07T20:33:07.6235848Z             if compiled:
2025-05-07T20:33:07.6235949Z                 op = torch.compile(op)
2025-05-07T20:33:07.6236051Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6236120Z     
2025-05-07T20:33:07.6236209Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.6236214Z 
2025-05-07T20:33:07.6236306Z moe/activation_test.py:117: 
2025-05-07T20:33:07.6236437Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6236537Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.6236634Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6237199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.6237304Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.6237661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.6237887Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.6238228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.6238318Z     kernel = self.compile(
2025-05-07T20:33:07.6238700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.6238877Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.6239004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6239009Z 
2025-05-07T20:33:07.6239211Z self = <triton.compiler.compiler.ASTSource object at 0x7f8917097100>
2025-05-07T20:33:07.6239998Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.6240505Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f891735e790>}
2025-05-07T20:33:07.6241256Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.6241491Z context = <triton._C.libtriton.ir.context object at 0x7f89170067b0>
2025-05-07T20:33:07.6241496Z 
2025-05-07T20:33:07.6241658Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.6241920Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.6242028Z                            module_map=module_map)
2025-05-07T20:33:07.6242189Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.6242289Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.6242361Z E       ^
2025-05-07T20:33:07.6242714Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.6242719Z 
2025-05-07T20:33:07.6243130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.6243138Z 
2025-05-07T20:33:07.6243276Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6243497Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6243572Z     T=2048,
2025-05-07T20:33:07.6243644Z     D=7168,
2025-05-07T20:33:07.6243730Z     scale_ub=1200.0,
2025-05-07T20:33:07.6243852Z     contiguous=True,
2025-05-07T20:33:07.6243934Z     compiled=False,
2025-05-07T20:33:07.6244008Z )
2025-05-07T20:33:07.6244231Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6244400Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.6244405Z 
2025-05-07T20:33:07.6244479Z     @given(
2025-05-07T20:33:07.6244595Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6244692Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6244805Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6244919Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6245039Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6245112Z     )
2025-05-07T20:33:07.6245356Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6245451Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6245524Z         self,
2025-05-07T20:33:07.6245644Z         T: int,
2025-05-07T20:33:07.6245727Z         D: int,
2025-05-07T20:33:07.6245822Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6245907Z         contiguous: bool,
2025-05-07T20:33:07.6245992Z         compiled: bool,
2025-05-07T20:33:07.6246066Z     ) -> None:
2025-05-07T20:33:07.6246159Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6246229Z     
2025-05-07T20:33:07.6246397Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6248199Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.6248210Z 
2025-05-07T20:33:07.6248327Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.6248331Z 
2025-05-07T20:33:07.6248434Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6248655Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6248728Z     T=1,
2025-05-07T20:33:07.6248806Z     D=5120,
2025-05-07T20:33:07.6248887Z     scale_ub=1200.0,
2025-05-07T20:33:07.6248969Z     contiguous=True,
2025-05-07T20:33:07.6249051Z     compiled=False,
2025-05-07T20:33:07.6249123Z )
2025-05-07T20:33:07.6249388Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6249550Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.6249555Z 
2025-05-07T20:33:07.6249628Z     @given(
2025-05-07T20:33:07.6249746Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6249846Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6249958Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6250075Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6250185Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6250256Z     )
2025-05-07T20:33:07.6250505Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6250594Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6250669Z         self,
2025-05-07T20:33:07.6250741Z         T: int,
2025-05-07T20:33:07.6250813Z         D: int,
2025-05-07T20:33:07.6250956Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6251048Z         contiguous: bool,
2025-05-07T20:33:07.6251130Z         compiled: bool,
2025-05-07T20:33:07.6251208Z     ) -> None:
2025-05-07T20:33:07.6251298Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6251367Z     
2025-05-07T20:33:07.6251540Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6251654Z     
2025-05-07T20:33:07.6251744Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.6251865Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.6251960Z         x = x_sign * x_clamp
2025-05-07T20:33:07.6252062Z         x0 = x[:, :D]
2025-05-07T20:33:07.6252144Z         x1 = x[:, D:]
2025-05-07T20:33:07.6252229Z     
2025-05-07T20:33:07.6252314Z         if contiguous:
2025-05-07T20:33:07.6252402Z             x0 = x0.contiguous()
2025-05-07T20:33:07.6252487Z             x1 = x1.contiguous()
2025-05-07T20:33:07.6252558Z     
2025-05-07T20:33:07.6252644Z         if scale_ub is not None:
2025-05-07T20:33:07.6252752Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.6252886Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.6252959Z             )
2025-05-07T20:33:07.6253032Z         else:
2025-05-07T20:33:07.6253128Z             scale_ub_tensor = None
2025-05-07T20:33:07.6253197Z     
2025-05-07T20:33:07.6253370Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.6253462Z             op = silu_mul_quant
2025-05-07T20:33:07.6253545Z             if compiled:
2025-05-07T20:33:07.6253644Z                 op = torch.compile(op)
2025-05-07T20:33:07.6253750Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6253820Z     
2025-05-07T20:33:07.6253911Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.6253916Z 
2025-05-07T20:33:07.6254011Z moe/activation_test.py:117: 
2025-05-07T20:33:07.6254137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6254240Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.6254340Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6254844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.6254937Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.6255301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.6255527Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.6255865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.6255957Z     kernel = self.compile(
2025-05-07T20:33:07.6256344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.6256516Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.6256689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6256694Z 
2025-05-07T20:33:07.6256898Z self = <triton.compiler.compiler.ASTSource object at 0x7f8916ff9fd0>
2025-05-07T20:33:07.6257683Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.6258194Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8916f15040>}
2025-05-07T20:33:07.6258945Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.6259134Z context = <triton._C.libtriton.ir.context object at 0x7f8916f18970>
2025-05-07T20:33:07.6259180Z 
2025-05-07T20:33:07.6259346Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.6259608Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.6259714Z                            module_map=module_map)
2025-05-07T20:33:07.6259918Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.6260017Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.6260091Z E       ^
2025-05-07T20:33:07.6260444Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.6260449Z 
2025-05-07T20:33:07.6260867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.6260872Z 
2025-05-07T20:33:07.6260971Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6261201Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6261277Z     T=2048,
2025-05-07T20:33:07.6261349Z     D=5120,
2025-05-07T20:33:07.6261430Z     scale_ub=None,
2025-05-07T20:33:07.6261510Z     contiguous=True,
2025-05-07T20:33:07.6261591Z     compiled=False,
2025-05-07T20:33:07.6261662Z )
2025-05-07T20:33:07.6261949Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6262140Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.6262145Z 
2025-05-07T20:33:07.6262220Z     @given(
2025-05-07T20:33:07.6262335Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6262433Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6262545Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6262662Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6262775Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6262844Z     )
2025-05-07T20:33:07.6263094Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6263188Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6263263Z         self,
2025-05-07T20:33:07.6263335Z         T: int,
2025-05-07T20:33:07.6263416Z         D: int,
2025-05-07T20:33:07.6263510Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6263602Z         contiguous: bool,
2025-05-07T20:33:07.6263687Z         compiled: bool,
2025-05-07T20:33:07.6263761Z     ) -> None:
2025-05-07T20:33:07.6263853Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6263922Z     
2025-05-07T20:33:07.6264088Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6264162Z     
2025-05-07T20:33:07.6264251Z >       x_sign = torch.sign(x)
2025-05-07T20:33:07.6266046Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.6266102Z 
2025-05-07T20:33:07.6266219Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:07.6266224Z 
2025-05-07T20:33:07.6266320Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6266548Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6266623Z     T=16384,
2025-05-07T20:33:07.6266696Z     D=5120,
2025-05-07T20:33:07.6266778Z     scale_ub=None,
2025-05-07T20:33:07.6266859Z     contiguous=True,
2025-05-07T20:33:07.6266941Z     compiled=False,
2025-05-07T20:33:07.6267012Z )
2025-05-07T20:33:07.6267274Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6267453Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.6267458Z 
2025-05-07T20:33:07.6267533Z     @given(
2025-05-07T20:33:07.6267646Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6267745Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6267921Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6268036Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6268150Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6268223Z     )
2025-05-07T20:33:07.6268467Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6268560Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6268633Z         self,
2025-05-07T20:33:07.6268705Z         T: int,
2025-05-07T20:33:07.6268780Z         D: int,
2025-05-07T20:33:07.6268874Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6268967Z         contiguous: bool,
2025-05-07T20:33:07.6269052Z         compiled: bool,
2025-05-07T20:33:07.6269125Z     ) -> None:
2025-05-07T20:33:07.6269219Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6269294Z     
2025-05-07T20:33:07.6269459Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6271418Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.6271426Z 
2025-05-07T20:33:07.6271540Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.6271551Z 
2025-05-07T20:33:07.6271654Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6271878Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6271953Z     T=4096,
2025-05-07T20:33:07.6272032Z     D=5120,
2025-05-07T20:33:07.6272112Z     scale_ub=None,
2025-05-07T20:33:07.6272199Z     contiguous=True,
2025-05-07T20:33:07.6272284Z     compiled=False,
2025-05-07T20:33:07.6272354Z )
2025-05-07T20:33:07.6272571Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6272744Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.6272749Z 
2025-05-07T20:33:07.6272822Z     @given(
2025-05-07T20:33:07.6272941Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6273036Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6273147Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6273264Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6273418Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6273490Z     )
2025-05-07T20:33:07.6273734Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6273823Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6273902Z         self,
2025-05-07T20:33:07.6273978Z         T: int,
2025-05-07T20:33:07.6274052Z         D: int,
2025-05-07T20:33:07.6274146Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6274235Z         contiguous: bool,
2025-05-07T20:33:07.6274317Z         compiled: bool,
2025-05-07T20:33:07.6274395Z     ) -> None:
2025-05-07T20:33:07.6274485Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6274554Z     
2025-05-07T20:33:07.6274723Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6276539Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.6276588Z 
2025-05-07T20:33:07.6276708Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.6276712Z 
2025-05-07T20:33:07.6276812Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6277038Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6277118Z     T=2048,
2025-05-07T20:33:07.6277193Z     D=5120,
2025-05-07T20:33:07.6277271Z     scale_ub=None,
2025-05-07T20:33:07.6277358Z     contiguous=False,
2025-05-07T20:33:07.6277439Z     compiled=False,
2025-05-07T20:33:07.6277516Z )
2025-05-07T20:33:07.6277733Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6277901Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.6277905Z 
2025-05-07T20:33:07.6277983Z     @given(
2025-05-07T20:33:07.6278139Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6278240Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6278355Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6278467Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6278576Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6278650Z     )
2025-05-07T20:33:07.6278892Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6278985Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6279059Z         self,
2025-05-07T20:33:07.6279132Z         T: int,
2025-05-07T20:33:07.6279206Z         D: int,
2025-05-07T20:33:07.6279307Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6279393Z         contiguous: bool,
2025-05-07T20:33:07.6279480Z         compiled: bool,
2025-05-07T20:33:07.6279553Z     ) -> None:
2025-05-07T20:33:07.6279646Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6279719Z     
2025-05-07T20:33:07.6279888Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6281667Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.6281713Z 
2025-05-07T20:33:07.6281830Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.6281834Z 
2025-05-07T20:33:07.6281936Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6282159Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6282233Z     T=4096,
2025-05-07T20:33:07.6282315Z     D=7168,
2025-05-07T20:33:07.6282394Z     scale_ub=None,
2025-05-07T20:33:07.6282477Z     contiguous=True,
2025-05-07T20:33:07.6282563Z     compiled=True,
2025-05-07T20:33:07.6282633Z )
2025-05-07T20:33:07.6282846Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6283017Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.6283021Z 
2025-05-07T20:33:07.6283094Z     @given(
2025-05-07T20:33:07.6283212Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6283308Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6283459Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6283583Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6283693Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6283766Z     )
2025-05-07T20:33:07.6284009Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6284143Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6284218Z         self,
2025-05-07T20:33:07.6284295Z         T: int,
2025-05-07T20:33:07.6284368Z         D: int,
2025-05-07T20:33:07.6284462Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6284551Z         contiguous: bool,
2025-05-07T20:33:07.6284633Z         compiled: bool,
2025-05-07T20:33:07.6284711Z     ) -> None:
2025-05-07T20:33:07.6284804Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6284874Z     
2025-05-07T20:33:07.6285041Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6286864Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.6286876Z 
2025-05-07T20:33:07.6286993Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.6286997Z 
2025-05-07T20:33:07.6287097Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6287319Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6287395Z     T=2048,
2025-05-07T20:33:07.6287469Z     D=5120,
2025-05-07T20:33:07.6287550Z     scale_ub=1200.0,
2025-05-07T20:33:07.6287635Z     contiguous=False,
2025-05-07T20:33:07.6287722Z     compiled=False,
2025-05-07T20:33:07.6287794Z )
2025-05-07T20:33:07.6288004Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6288174Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:07.6288179Z 
2025-05-07T20:33:07.6288261Z     @given(
2025-05-07T20:33:07.6288377Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6288473Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6288589Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6288702Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6288812Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6288886Z     )
2025-05-07T20:33:07.6289129Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6289222Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6289297Z         self,
2025-05-07T20:33:07.6289415Z         T: int,
2025-05-07T20:33:07.6289492Z         D: int,
2025-05-07T20:33:07.6289587Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6289672Z         contiguous: bool,
2025-05-07T20:33:07.6289758Z         compiled: bool,
2025-05-07T20:33:07.6289831Z     ) -> None:
2025-05-07T20:33:07.6289922Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6290001Z     
2025-05-07T20:33:07.6290163Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6292038Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.6292046Z 
2025-05-07T20:33:07.6292163Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.6292168Z 
2025-05-07T20:33:07.6292269Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6292492Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6292605Z     T=4096,
2025-05-07T20:33:07.6292681Z     D=7168,
2025-05-07T20:33:07.6292759Z     scale_ub=1200.0,
2025-05-07T20:33:07.6292840Z     contiguous=True,
2025-05-07T20:33:07.6292925Z     compiled=False,
2025-05-07T20:33:07.6292995Z )
2025-05-07T20:33:07.6293212Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6293381Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.6293386Z 
2025-05-07T20:33:07.6293461Z     @given(
2025-05-07T20:33:07.6293579Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6293682Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6293793Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6293910Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6294020Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6294093Z     )
2025-05-07T20:33:07.6294384Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6294478Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6294551Z         self,
2025-05-07T20:33:07.6294630Z         T: int,
2025-05-07T20:33:07.6294702Z         D: int,
2025-05-07T20:33:07.6294795Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6294884Z         contiguous: bool,
2025-05-07T20:33:07.6294966Z         compiled: bool,
2025-05-07T20:33:07.6295044Z     ) -> None:
2025-05-07T20:33:07.6295135Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6295204Z     
2025-05-07T20:33:07.6295367Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6297157Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.6297165Z 
2025-05-07T20:33:07.6297284Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.6297289Z 
2025-05-07T20:33:07.6297387Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6297609Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6297688Z     T=16384,
2025-05-07T20:33:07.6297760Z     D=7168,
2025-05-07T20:33:07.6297883Z     scale_ub=None,
2025-05-07T20:33:07.6297968Z     contiguous=False,
2025-05-07T20:33:07.6298047Z     compiled=True,
2025-05-07T20:33:07.6298122Z )
2025-05-07T20:33:07.6298341Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6298516Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:07.6298523Z 
2025-05-07T20:33:07.6298599Z     @given(
2025-05-07T20:33:07.6298716Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6298809Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6298922Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6299034Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6299144Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6299219Z     )
2025-05-07T20:33:07.6299461Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6299620Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6299698Z         self,
2025-05-07T20:33:07.6299775Z         T: int,
2025-05-07T20:33:07.6299854Z         D: int,
2025-05-07T20:33:07.6299948Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6300033Z         contiguous: bool,
2025-05-07T20:33:07.6300118Z         compiled: bool,
2025-05-07T20:33:07.6300233Z     ) -> None:
2025-05-07T20:33:07.6300323Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6300396Z     
2025-05-07T20:33:07.6300560Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6302392Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.6302400Z 
2025-05-07T20:33:07.6302512Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.6302517Z 
2025-05-07T20:33:07.6302617Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6302882Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6302958Z     T=4096,
2025-05-07T20:33:07.6303034Z     D=7168,
2025-05-07T20:33:07.6303114Z     scale_ub=None,
2025-05-07T20:33:07.6303195Z     contiguous=True,
2025-05-07T20:33:07.6303279Z     compiled=False,
2025-05-07T20:33:07.6303349Z )
2025-05-07T20:33:07.6303566Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6304052Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.6304060Z 
2025-05-07T20:33:07.6304154Z     @given(
2025-05-07T20:33:07.6304277Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6304370Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6304479Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6304594Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6304701Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6304773Z     )
2025-05-07T20:33:07.6305019Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6305107Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6305179Z         self,
2025-05-07T20:33:07.6305252Z         T: int,
2025-05-07T20:33:07.6305320Z         D: int,
2025-05-07T20:33:07.6305414Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6305503Z         contiguous: bool,
2025-05-07T20:33:07.6305583Z         compiled: bool,
2025-05-07T20:33:07.6305659Z     ) -> None:
2025-05-07T20:33:07.6305745Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6305814Z     
2025-05-07T20:33:07.6306076Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6307874Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.6307883Z 
2025-05-07T20:33:07.6307996Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.6308001Z 
2025-05-07T20:33:07.6308096Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6308317Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6308461Z     T=16384,
2025-05-07T20:33:07.6308535Z     D=7168,
2025-05-07T20:33:07.6308611Z     scale_ub=None,
2025-05-07T20:33:07.6308692Z     contiguous=True,
2025-05-07T20:33:07.6308772Z     compiled=False,
2025-05-07T20:33:07.6308844Z )
2025-05-07T20:33:07.6309056Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6309298Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.6309302Z 
2025-05-07T20:33:07.6309381Z     @given(
2025-05-07T20:33:07.6309494Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6309588Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6309703Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6309878Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6309986Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6310060Z     )
2025-05-07T20:33:07.6310309Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6310406Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6310478Z         self,
2025-05-07T20:33:07.6310550Z         T: int,
2025-05-07T20:33:07.6310623Z         D: int,
2025-05-07T20:33:07.6310717Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6310798Z         contiguous: bool,
2025-05-07T20:33:07.6310955Z         compiled: bool,
2025-05-07T20:33:07.6311030Z     ) -> None:
2025-05-07T20:33:07.6311120Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6311191Z     
2025-05-07T20:33:07.6311355Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6313163Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.6313172Z 
2025-05-07T20:33:07.6313284Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.6313292Z 
2025-05-07T20:33:07.6313395Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6313614Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6313686Z     T=16384,
2025-05-07T20:33:07.6313758Z     D=7168,
2025-05-07T20:33:07.6313833Z     scale_ub=1200.0,
2025-05-07T20:33:07.6313913Z     contiguous=True,
2025-05-07T20:33:07.6313999Z     compiled=False,
2025-05-07T20:33:07.6314065Z )
2025-05-07T20:33:07.6314279Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6314453Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.6314504Z 
2025-05-07T20:33:07.6314580Z     @given(
2025-05-07T20:33:07.6314697Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6314791Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6314902Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6315020Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6315131Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6315200Z     )
2025-05-07T20:33:07.6315445Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6315535Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6315604Z         self,
2025-05-07T20:33:07.6315677Z         T: int,
2025-05-07T20:33:07.6315754Z         D: int,
2025-05-07T20:33:07.6315848Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6315937Z         contiguous: bool,
2025-05-07T20:33:07.6316015Z         compiled: bool,
2025-05-07T20:33:07.6316092Z     ) -> None:
2025-05-07T20:33:07.6316226Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6316295Z     
2025-05-07T20:33:07.6316461Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6318260Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.6318363Z 
2025-05-07T20:33:07.6318478Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.6318482Z 
2025-05-07T20:33:07.6318578Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6318804Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6318883Z     T=128,
2025-05-07T20:33:07.6318956Z     D=5120,
2025-05-07T20:33:07.6319034Z     scale_ub=1200.0,
2025-05-07T20:33:07.6319116Z     contiguous=False,
2025-05-07T20:33:07.6319196Z     compiled=False,
2025-05-07T20:33:07.6319271Z )
2025-05-07T20:33:07.6319525Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6319699Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:07.6319704Z 
2025-05-07T20:33:07.6319776Z     @given(
2025-05-07T20:33:07.6319888Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6319984Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6320095Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6320208Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6320315Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6320388Z     )
2025-05-07T20:33:07.6320636Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6320729Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6320800Z         self,
2025-05-07T20:33:07.6320868Z         T: int,
2025-05-07T20:33:07.6320941Z         D: int,
2025-05-07T20:33:07.6321043Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6321129Z         contiguous: bool,
2025-05-07T20:33:07.6321215Z         compiled: bool,
2025-05-07T20:33:07.6321288Z     ) -> None:
2025-05-07T20:33:07.6321377Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6321448Z     
2025-05-07T20:33:07.6321609Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6321679Z     
2025-05-07T20:33:07.6321769Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.6321886Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.6321976Z         x = x_sign * x_clamp
2025-05-07T20:33:07.6322050Z         x0 = x[:, :D]
2025-05-07T20:33:07.6322197Z         x1 = x[:, D:]
2025-05-07T20:33:07.6322272Z     
2025-05-07T20:33:07.6322352Z         if contiguous:
2025-05-07T20:33:07.6322439Z             x0 = x0.contiguous()
2025-05-07T20:33:07.6322525Z             x1 = x1.contiguous()
2025-05-07T20:33:07.6322593Z     
2025-05-07T20:33:07.6322681Z         if scale_ub is not None:
2025-05-07T20:33:07.6322789Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.6322920Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.6322990Z             )
2025-05-07T20:33:07.6323064Z         else:
2025-05-07T20:33:07.6323153Z             scale_ub_tensor = None
2025-05-07T20:33:07.6323227Z     
2025-05-07T20:33:07.6323351Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.6323436Z             op = silu_mul_quant
2025-05-07T20:33:07.6323520Z             if compiled:
2025-05-07T20:33:07.6323615Z                 op = torch.compile(op)
2025-05-07T20:33:07.6323716Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6323846Z     
2025-05-07T20:33:07.6323938Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.6323942Z 
2025-05-07T20:33:07.6324040Z moe/activation_test.py:117: 
2025-05-07T20:33:07.6324168Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6324267Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.6324401Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6324910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.6325002Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.6325363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.6325587Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.6325927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.6326023Z     kernel = self.compile(
2025-05-07T20:33:07.6326403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.6326575Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.6326741Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6326749Z 
2025-05-07T20:33:07.6326954Z self = <triton.compiler.compiler.ASTSource object at 0x7f8916e4b160>
2025-05-07T20:33:07.6327741Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.6328252Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8916e4cca0>}
2025-05-07T20:33:07.6329011Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.6329203Z context = <triton._C.libtriton.ir.context object at 0x7f8916d87370>
2025-05-07T20:33:07.6329211Z 
2025-05-07T20:33:07.6329373Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.6329729Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.6329864Z                            module_map=module_map)
2025-05-07T20:33:07.6330028Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.6330121Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.6330193Z E       ^
2025-05-07T20:33:07.6330560Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.6330641Z 
2025-05-07T20:33:07.6331052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.6331057Z 
2025-05-07T20:33:07.6331161Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6331382Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6331459Z     T=2048,
2025-05-07T20:33:07.6331539Z     D=7168,
2025-05-07T20:33:07.6331634Z     scale_ub=None,
2025-05-07T20:33:07.6335876Z     contiguous=False,
2025-05-07T20:33:07.6335970Z     compiled=False,
2025-05-07T20:33:07.6336041Z )
2025-05-07T20:33:07.6336273Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6336447Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.6336453Z 
2025-05-07T20:33:07.6336530Z     @given(
2025-05-07T20:33:07.6336647Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6336808Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6336923Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6337035Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6337143Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6337259Z     )
2025-05-07T20:33:07.6337507Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6337597Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6337675Z         self,
2025-05-07T20:33:07.6337749Z         T: int,
2025-05-07T20:33:07.6337820Z         D: int,
2025-05-07T20:33:07.6337917Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6338001Z         contiguous: bool,
2025-05-07T20:33:07.6338083Z         compiled: bool,
2025-05-07T20:33:07.6338157Z     ) -> None:
2025-05-07T20:33:07.6338249Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6338323Z     
2025-05-07T20:33:07.6338493Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6340336Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.6340350Z 
2025-05-07T20:33:07.6340466Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.6340470Z 
2025-05-07T20:33:07.6340567Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6340787Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6340861Z     T=128,
2025-05-07T20:33:07.6340937Z     D=7168,
2025-05-07T20:33:07.6341018Z     scale_ub=1200.0,
2025-05-07T20:33:07.6341097Z     contiguous=True,
2025-05-07T20:33:07.6341177Z     compiled=True,
2025-05-07T20:33:07.6341248Z )
2025-05-07T20:33:07.6341466Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6341636Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.6341644Z 
2025-05-07T20:33:07.6341720Z     @given(
2025-05-07T20:33:07.6341834Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6341939Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6342068Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6342202Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6342316Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6342383Z     )
2025-05-07T20:33:07.6342625Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6342767Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6342837Z         self,
2025-05-07T20:33:07.6342914Z         T: int,
2025-05-07T20:33:07.6342986Z         D: int,
2025-05-07T20:33:07.6343080Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6343168Z         contiguous: bool,
2025-05-07T20:33:07.6343250Z         compiled: bool,
2025-05-07T20:33:07.6343332Z     ) -> None:
2025-05-07T20:33:07.6343425Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6343492Z     
2025-05-07T20:33:07.6343655Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6343729Z     
2025-05-07T20:33:07.6343820Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.6343945Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.6344028Z         x = x_sign * x_clamp
2025-05-07T20:33:07.6344103Z         x0 = x[:, :D]
2025-05-07T20:33:07.6344181Z         x1 = x[:, D:]
2025-05-07T20:33:07.6344248Z     
2025-05-07T20:33:07.6344328Z         if contiguous:
2025-05-07T20:33:07.6344467Z             x0 = x0.contiguous()
2025-05-07T20:33:07.6344553Z             x1 = x1.contiguous()
2025-05-07T20:33:07.6344621Z     
2025-05-07T20:33:07.6344710Z         if scale_ub is not None:
2025-05-07T20:33:07.6344809Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.6344943Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.6345058Z             )
2025-05-07T20:33:07.6345129Z         else:
2025-05-07T20:33:07.6345223Z             scale_ub_tensor = None
2025-05-07T20:33:07.6345294Z     
2025-05-07T20:33:07.6345424Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.6345513Z             op = silu_mul_quant
2025-05-07T20:33:07.6345594Z             if compiled:
2025-05-07T20:33:07.6345690Z                 op = torch.compile(op)
2025-05-07T20:33:07.6345794Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6345863Z     
2025-05-07T20:33:07.6345951Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.6345958Z 
2025-05-07T20:33:07.6346061Z moe/activation_test.py:117: 
2025-05-07T20:33:07.6346184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6346285Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.6346378Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6346786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.6346881Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.6347371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.6347465Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.6347820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.6348042Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.6348381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.6348473Z     kernel = self.compile(
2025-05-07T20:33:07.6348849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.6349028Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.6349156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6349161Z 
2025-05-07T20:33:07.6349363Z self = <triton.compiler.compiler.ASTSource object at 0x7f8916cd5700>
2025-05-07T20:33:07.6350215Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.6350724Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f890b3068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8916d390d0>}
2025-05-07T20:33:07.6351520Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.6351710Z context = <triton._C.libtriton.ir.context object at 0x7f8916bf1030>
2025-05-07T20:33:07.6351715Z 
2025-05-07T20:33:07.6351882Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.6352143Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.6352248Z                            module_map=module_map)
2025-05-07T20:33:07.6352414Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.6352509Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.6352586Z E       ^
2025-05-07T20:33:07.6352983Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.6352991Z 
2025-05-07T20:33:07.6353400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.6353405Z 
2025-05-07T20:33:07.6353505Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6353761Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6353834Z     T=128,
2025-05-07T20:33:07.6353909Z     D=7168,
2025-05-07T20:33:07.6353986Z     scale_ub=1200.0,
2025-05-07T20:33:07.6354065Z     contiguous=True,
2025-05-07T20:33:07.6354147Z     compiled=False,
2025-05-07T20:33:07.6354217Z )
2025-05-07T20:33:07.6354433Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6354600Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.6354604Z 
2025-05-07T20:33:07.6354677Z     @given(
2025-05-07T20:33:07.6354805Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6354902Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6355012Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6355131Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6355279Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6355352Z     )
2025-05-07T20:33:07.6355604Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6355695Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6355771Z         self,
2025-05-07T20:33:07.6355841Z         T: int,
2025-05-07T20:33:07.6355909Z         D: int,
2025-05-07T20:33:07.6356005Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6356088Z         contiguous: bool,
2025-05-07T20:33:07.6356168Z         compiled: bool,
2025-05-07T20:33:07.6356247Z     ) -> None:
2025-05-07T20:33:07.6356336Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6356408Z     
2025-05-07T20:33:07.6356580Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6356654Z     
2025-05-07T20:33:07.6356744Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.6356871Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.6358655Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.6358667Z 
2025-05-07T20:33:07.6358781Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:07.6358830Z 
2025-05-07T20:33:07.6358931Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6359158Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6359233Z     T=128,
2025-05-07T20:33:07.6359302Z     D=5120,
2025-05-07T20:33:07.6359381Z     scale_ub=1200.0,
2025-05-07T20:33:07.6359459Z     contiguous=True,
2025-05-07T20:33:07.6359539Z     compiled=True,
2025-05-07T20:33:07.6359609Z )
2025-05-07T20:33:07.6359820Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6359982Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.6359989Z 
2025-05-07T20:33:07.6360060Z     @given(
2025-05-07T20:33:07.6360173Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6360268Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6360379Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6360491Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6360652Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6360723Z     )
2025-05-07T20:33:07.6360965Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6361058Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6361128Z         self,
2025-05-07T20:33:07.6361241Z         T: int,
2025-05-07T20:33:07.6361315Z         D: int,
2025-05-07T20:33:07.6361408Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6361506Z         contiguous: bool,
2025-05-07T20:33:07.6361598Z         compiled: bool,
2025-05-07T20:33:07.6361681Z     ) -> None:
2025-05-07T20:33:07.6361785Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6361851Z     
2025-05-07T20:33:07.6362014Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6362089Z     
2025-05-07T20:33:07.6362175Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.6362293Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.6364128Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.6364138Z 
2025-05-07T20:33:07.6364255Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:07.6364260Z 
2025-05-07T20:33:07.6364359Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6364578Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6364654Z     T=128,
2025-05-07T20:33:07.6364725Z     D=7168,
2025-05-07T20:33:07.6364801Z     scale_ub=None,
2025-05-07T20:33:07.6364889Z     contiguous=True,
2025-05-07T20:33:07.6364967Z     compiled=True,
2025-05-07T20:33:07.6365036Z )
2025-05-07T20:33:07.6365248Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6365410Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.6365419Z 
2025-05-07T20:33:07.6365494Z     @given(
2025-05-07T20:33:07.6365611Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6365707Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6365818Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6365929Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6366036Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6366107Z     )
2025-05-07T20:33:07.6366348Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6366434Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6366557Z         self,
2025-05-07T20:33:07.6366629Z         T: int,
2025-05-07T20:33:07.6366698Z         D: int,
2025-05-07T20:33:07.6366795Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6366877Z         contiguous: bool,
2025-05-07T20:33:07.6366957Z         compiled: bool,
2025-05-07T20:33:07.6367035Z     ) -> None:
2025-05-07T20:33:07.6367127Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6367199Z     
2025-05-07T20:33:07.6367362Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6369171Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.6369183Z 
2025-05-07T20:33:07.6369295Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.6369425Z =============================== warnings summary ===============================
2025-05-07T20:33:07.6369772Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:07.6370075Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:07.6370365Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:07.6371250Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:33:07.6371484Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:33:07.6371488Z 
2025-05-07T20:33:07.6371700Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:33:07.6371883Z ================= 1 failed, 1 deselected, 3 warnings in 19.25s =================
2025-05-07T20:33:09.1173757Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:33:09.1806237Z [EXEC] [ATTEMPT 2/2] Command attempt failed.
2025-05-07T20:33:09.1806503Z 
2025-05-07T20:33:09.1806673Z [EXEC] The command has failed after 2 + 1 attempts; aborting.
2025-05-07T20:33:09.1807238Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py
2025-05-07T20:33:09.1807629Z 
2025-05-07T20:33:09.1807633Z 
2025-05-07T20:33:09.1807637Z 
2025-05-07T20:33:09.1824996Z ##[error]Process completed with exit code 1.
2025-05-07T20:33:09.1905289Z Post job cleanup.
2025-05-07T20:33:09.2887495Z [command]/usr/bin/git version
2025-05-07T20:33:09.2932511Z git version 2.47.1
2025-05-07T20:33:09.2971276Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/0437cfae-c772-4cbd-8dab-3158a79dbfad/.gitconfig'
2025-05-07T20:33:09.2982176Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/0437cfae-c772-4cbd-8dab-3158a79dbfad' before making global git config changes
2025-05-07T20:33:09.2983039Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:33:09.2987863Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:33:09.3033971Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:33:09.3068298Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:33:09.3403641Z Entering 'external/asmjit'
2025-05-07T20:33:09.3471469Z Entering 'external/composable_kernel'
2025-05-07T20:33:09.3545552Z Entering 'external/cpuinfo'
2025-05-07T20:33:09.3612149Z Entering 'external/cutlass'
2025-05-07T20:33:09.3686965Z Entering 'external/googletest'
2025-05-07T20:33:09.3752736Z Entering 'external/hipify_torch'
2025-05-07T20:33:09.3819357Z Entering 'external/json'
2025-05-07T20:33:09.3905330Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:33:09.3933286Z http.https://github.com/.extraheader
2025-05-07T20:33:09.3945447Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader
2025-05-07T20:33:09.3979975Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:33:09.4309195Z Entering 'external/asmjit'
2025-05-07T20:33:09.4353045Z http.https://github.com/.extraheader
2025-05-07T20:33:09.4395621Z Entering 'external/composable_kernel'
2025-05-07T20:33:09.4438595Z http.https://github.com/.extraheader
2025-05-07T20:33:09.4487744Z Entering 'external/cpuinfo'
2025-05-07T20:33:09.4531439Z http.https://github.com/.extraheader
2025-05-07T20:33:09.4575796Z Entering 'external/cutlass'
2025-05-07T20:33:09.4618746Z http.https://github.com/.extraheader
2025-05-07T20:33:09.4669899Z Entering 'external/googletest'
2025-05-07T20:33:09.4717460Z http.https://github.com/.extraheader
2025-05-07T20:33:09.4760238Z Entering 'external/hipify_torch'
2025-05-07T20:33:09.4802236Z http.https://github.com/.extraheader
2025-05-07T20:33:09.4844573Z Entering 'external/json'
2025-05-07T20:33:09.4891277Z http.https://github.com/.extraheader
2025-05-07T20:33:09.5041005Z A job completed hook has been configured by the self-hosted runner administrator
2025-05-07T20:33:09.5071536Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh'
2025-05-07T20:33:09.5081922Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:33:09.5082308Z ##[endgroup]
2025-05-07T20:33:09.5201906Z [!ALERT!] Swap in detected! [!ALERT!]
2025-05-07T20:33:20.2975340Z [!ALERT!] Swap out detected [!ALERT!]
2025-05-07T20:33:36.6410716Z Cleaning up orphan processes